Skip to content

Commit

Permalink
Add embedding and MLP support for sparse wide features (#24)
Browse files Browse the repository at this point in the history
# Description

Currently DeText's design for sparse feature has simple modeling power for sparse features.
1. only linear model is applied on sparse features
2. there's no interaction between sparse features and dense features (model_score = dense_score + sparse_score)

This PR resolves the above limitation on sparse feature by
1. computing dense representation of sparse features
2. allowing interactions between sparse features and wide features

More specifically, the model architecture changes from
```
dense_score = dense_ftrs -> MLP
sparse_score = sparse_ftrs -> Linear
final_score = dense_score + sparse_score
```
to
```
sparse_emb_ftrs = sparse_ftrs -> Dense(sp_emb_size)
all_ftrs = (dense_ftrs, sparse_emb_ftrs) -> Concatenate
final_score= all_ftrs -> MLP
```
## Type of change

- [ ] New feature (non-breaking change which adds functionality)

## List all changes 
Please list all changes in the commit.
* Change sp_linear_model to sp_emb_model and add an option sp_emb_size to allow the sparse matrix to have output dimension > 1
* Change structure of dense & sparse feature interaction as mentioned in the PR description
* Add and restructure unit test for sparse embedding model
* Add new data for testing
* Add a sample tfrecord generation helper function in misc_utils.py
* Add instructions in TRAINING.md

# Testing
- Successfully run run_detext.sh for data including wide_sp_val and sp_emb_size=10
- Successfully run run_detext_multitask.sh for data
- Unit test for sparse_emb_model when sp_emb_size is 1 and > 1
# Checklist

- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [ ] Any dependent changes have been merged and published in downstream modules
  • Loading branch information
StarWang authored May 29, 2020
1 parent 9c6b54b commit f0ee982
Show file tree
Hide file tree
Showing 14 changed files with 346 additions and 278 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,13 @@ The DeText framework contains multiple components:

**Word embedding layer**. It converts the sequence of words into a d by n matrix.

**CNN/BERT/LSTM for text embedding layer**. It takes into the word embedding matrix as input, and maps the text data into a fixed length embedding. It is worth noting that we adopt the representation based methods over the interaction based methods. The main reason is the computational complexity: The time complexity of interaction based methods is at least O(mnd), which is one order higher than the representation based methods max(O(md), O(nd).
**CNN/BERT/LSTM for text encoding layer**. It takes into the word embedding matrix as input, and maps the text data into a fixed length embedding. It is worth noting that we adopt the representation based methods over the interaction based methods. The main reason is the computational complexity: The time complexity of interaction based methods is at least O(mnd), which is one order higher than the representation based methods max(O(md), O(nd).

**Interaction layer**. It generates deep features based on the text embeddings. Many options are provided, such as concatenation, cosine similarity, etc.

**Traditional Feature Processing**. We combine the traditional features with the interaction features (deep features) in a wide & deep fashion.
**Wide & Deep Feature Processing**. We combine the traditional features with the interaction features (deep features) in a wide & deep fashion.

**MLP layer**. The MLP layer is to combine traditional features and deep features.
**MLP layer**. The MLP layer is to combine wide features and deep features.

It is an end-to-end model where all the parameters are jointly updated to optimize the click probability.

Expand Down
41 changes: 36 additions & 5 deletions TRAINING.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,43 @@ DeText uses TFRecords format for training data. In general, the input data shou
* One field for "wide features" with name `wide_ftrs`
* Multiple fields for "document fields" with name `doc_<field name>`
* One field for "labels" with name `label`
* [optional] Mutiple fields for "user fields" with name `usr_<field name>`
* [optional] One field for "sparse wide features indices" with name `wide_ftrs_sp_idx` and one field for
"sparse wide features values" with name `wide_ftrs_sp_val`

We show an example of the prepared training data and explain the data format and shapes.
* `query` (string list containing only 1 string)
* For each training sample, there should be 1 query field.
* eg. ["how do you del ##ete messages"]
* `wide_ftrs` (float list)
* There could be multiple wide features for each document. Therefore the wide features are really a 2-D array with shape [#documents, #wide features per document]. Since TFRecords support 1-D FloatList, we flatten the wide features in the preparation and transform to grouped features in `train/data_fn.py` by reshaping. Therefore the float list of wide_ftrs in the training data has `#documents * #wide features per document = 10 * 3 = 30` entries.
* [0.305 0.264 0.180 0.192 0.136 0.027 0.273 0.273 0.377 0.233 0.264 0.227 0.119 0.119 0.198 0.212 0.274 0.047 0.320 0.255 0.350 0.000 0.000 0.357 0.301 0.367 0.292 0.000 0.000 0.170]
* There could be multiple dense wide features for each document. Therefore the dense wide features are a 2-D array
with shape [#documents, #dense wide features per document]. Since TFRecords support 1-D FloatList, we flatten the
dense wide features in the preparation and transform to grouped features in `train/data_fn.py` by reshaping.
Therefore the float list of wide_ftrs in the training data has `#documents * #dense wide features per document = 4 * 3 = 12` entries.
The dense wide features belong to each document sequentially. I.e., the first 3 wide features belong to the first
document, the second 3 wide features belong to the second document, etc..
* [0.305 0.264 0.180 0.192 0.136 0.027 0.273 0.273 0.377 0.233 0.264 0.227]
* `wide_ftrs_sp_idx` (int list)
* There could be multiple sparse wide features for each document. Therefore the sparse wide features indices are a
2-D array with shape [#documents, #max num of sparse wide features among documents]. Since TFRecords support 1-D
IntList, we flatten the sparse wide features indices in the preparation and transform to grouped features in
`train/data_fn.py` by reshaping. Therefore the int list of wide_ftrs_sp_idx in the training data has
`#sum_i(num documents in list i * #max sparse wide features in list i)` entries.
Within the same list, if the number
of sparse feature of document m is smaller than max number of sparse wide features in the list, the sparse feature
indices must be padded with 0. An example below shows the wide_ftrs_sp_idx for a list where the maximum number
of sparse wide features is 2 and the list has 4 documents. The sparse wide features belong to each document
sequentially. I.e., the first 2 wide features belong to the first document, the second 2 wide features belong to
the second document, etc..
Note that **0 should NEVER be used for wide_ftrs_sp_idx except for padding**.
* [3 2 5000 20 1 0 8 0]
* `wide_ftrs_sp_val` (float list)
* Sparse wide feature values are in the same shape and must be correspondent to as sparse wide feature indices. I.e.,
if the sparse feature indices of list i is [1, 5, 2], then the sparse feature values [-5.0, 12.0, 11.0] means that
the sparse wide features for this list is [-5.0, 11.0, 0.0, 0.0, 12.0]. If this field is missing, values
corresponding to sparse wide feature indices will be set to 1 by default. Values corresponding to padding values of
sparse wide feature indices must be set to 0.
* [3 2 5000 20 1 0 8 0]
* `label` (float list)
* The labels corresponding to each document. In our example, 0 for documents without any click and 1 for documents with clicks.
* [0 0 1 0 0 0 0 0 0 0]
Expand All @@ -31,6 +60,8 @@ The following example (from [run_detext.sh](src/detext/resources/run_detext.sh))
The train/dev/test datasets are prepared in the format mentioned in the previous section. More specifically, the following fields are used:
* `query`
* `wide_ftrs`
* `wide_ftrs_sp_idx`
* `wide_ftrs_sp_val`
* `doc_titles`
* `label`

Expand Down Expand Up @@ -80,15 +111,15 @@ A complete list of training parameters that DeText provides is given below. User
| Network | ftr_ext | str | cnn, bert, lstm, lstm_lm | | NLP feature extraction module. |
| | num_units | int | | 128 | word embedding size. |
| | num_units_for_id_ftr | int | | 128 | id feature embedding size. |
| | num_usr_fields | int | | 0 | number of user fields. |
| | num_hidden | str | | 0 | hidden size. This could be a number or a list of comma separated numbers for multiple hidden layers. |
| | num_wide | int | | 0 | number of wide features per doc. |
| | ltr_loss_fn | str | | pairwise | learning-to-rank method. |
| | use_deep | str2bool | | TRUE | Whether to use deep features. |
| | elem_rescale | str2bool | | TRUE | Whether to perform elementwise rescaling. |
| | emb_sim_func | str | | inner | The approach to compute query/doc similarity scores: inner/hadamard/concat or any combination of them separater by comma. |
| | num_wide_sp | int | | None | number of sparse wide features per doc |
| | num_classes | int | | 1 | Number of classes for multi-class classification tasks. This should be set to the number of classes in the multiclass classification task. |
| Sparse feature related | num_wide_sp | int | | None | maximum number of sparse wide features|
| | sp_emb_size | int | | 1 | embedding size of sparse wide features|
| | | | | | |
| CNN related | filter_window_sizes | str | | "1,2,3" | CNN filter window sizes. |
| | num_filters | int | | 100 | number of CNN filters. |
Expand All @@ -98,7 +129,7 @@ A complete list of training parameters that DeText provides is given below. User
| | bert_config_file | str | | None | bert config. |
| | bert_checkpoint | str | | None | pretrained bert model checkpoint. |
| | | | | | |
| LSTM related | unit_type | str | lstm,gru,layer_norm_lstm | lstm | RNN cell unit type. Support lstm/gru/layer_norm_lstm |
| LSTM related | unit_type | str | lstm | lstm | RNN cell unit type. Currently only supports lstm
| | num_layers | int | | 1 | RNN layers |
| | num_residual_layers | int | | 0 | Number of residual layers from top to bottom. For example, if `num_layers=4` and `num_residual_layers=2`, the last 2 RNN cells in the returned list will be wrapped with `ResidualWrapper`. |
| | forget_bias | float | | 1 | Forget bias of RNN cell |
Expand Down
Binary file modified detext_model_architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"Intended Audience :: Developers",
"License :: OSI Approved"],
license='BSD-2-CLAUSE',
version='1.1.5',
version='1.2.0',
package_dir={'': 'src'},
packages=setuptools.find_packages('src'),
include_package_data=True,
Expand Down
Loading

0 comments on commit f0ee982

Please sign in to comment.