Add embedding and MLP support for sparse wide features (#24)

# Description Currently DeText's design for sparse feature has simple modeling power for sparse features. 1. only linear model is applied on sparse features 2. there's no interaction between sparse features and dense features (model_score = dense_score + sparse_score) This PR resolves the above limitation on sparse feature by 1. computing dense representation of sparse features 2. allowing interactions between sparse features and wide features More specifically, the model architecture changes from ``` dense_score = dense_ftrs -> MLP sparse_score = sparse_ftrs -> Linear final_score = dense_score + sparse_score ``` to ``` sparse_emb_ftrs = sparse_ftrs -> Dense(sp_emb_size) all_ftrs = (dense_ftrs, sparse_emb_ftrs) -> Concatenate final_score= all_ftrs -> MLP ``` ## Type of change - [ ] New feature (non-breaking change which adds functionality) ## List all changes Please list all changes in the commit. * Change sp_linear_model to sp_emb_model and add an option sp_emb_size to allow the sparse matrix to have output dimension > 1 * Change structure of dense & sparse feature interaction as mentioned in the PR description * Add and restructure unit test for sparse embedding model * Add new data for testing * Add a sample tfrecord generation helper function in misc_utils.py * Add instructions in TRAINING.md # Testing - Successfully run run_detext.sh for data including wide_sp_val and sp_emb_size=10 - Successfully run run_detext_multitask.sh for data - Unit test for sparse_emb_model when sp_emb_size is 1 and > 1 # Checklist - [ ] My code follows the style guidelines of this project - [ ] I have performed a self-review of my own code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes - [ ] Any dependent changes have been merged and published in downstream modules
linkedin · May 29, 2020 · f0ee982 · f0ee982
1 parent 9c6b54b
commit f0ee982
Show file tree

Hide file tree

Showing 14 changed files with 346 additions and 278 deletions.
diff --git a/README.md b/README.md
@@ -34,13 +34,13 @@ The DeText framework contains multiple components:
 
 **Word embedding layer**.  It converts the sequence of words into a d by n matrix.
 
-**CNN/BERT/LSTM for text embedding layer**.  It takes into the word embedding matrix as input, and maps the text data into a fixed length embedding.  It is worth noting that we adopt the representation based methods over the interaction based methods.  The main reason is the computational complexity: The time complexity of interaction based methods is at least O(mnd), which is one order higher than the representation based methods max(O(md), O(nd).
+**CNN/BERT/LSTM for text encoding layer**.  It takes into the word embedding matrix as input, and maps the text data into a fixed length embedding.  It is worth noting that we adopt the representation based methods over the interaction based methods.  The main reason is the computational complexity: The time complexity of interaction based methods is at least O(mnd), which is one order higher than the representation based methods max(O(md), O(nd).
 
 **Interaction layer**.  It generates deep features based on the text embeddings.  Many options are provided, such as concatenation, cosine similarity, etc.
 
-**Traditional Feature Processing**.  We combine the traditional features with the interaction features (deep features) in a wide & deep fashion.
+**Wide & Deep Feature Processing**.  We combine the traditional features with the interaction features (deep features) in a wide & deep fashion.
 
-**MLP layer**. The MLP layer is to combine traditional features and deep features. 
+**MLP layer**. The MLP layer is to combine wide features and deep features. 
 
 It is an end-to-end model where all the parameters are jointly updated to optimize the click probability.
 

diff --git a/TRAINING.md b/TRAINING.md
@@ -8,14 +8,43 @@ DeText uses TFRecords format for training data.  In general, the input data shou
 * One field for "wide features" with name `wide_ftrs`
 * Multiple fields for "document fields" with name `doc_<field name>`
 * One field for "labels" with name `label`
+* [optional] Mutiple fields for "user fields" with name `usr_<field name>`
+* [optional] One field for "sparse wide features indices" with name `wide_ftrs_sp_idx` and one field for 
+  "sparse wide features values" with name `wide_ftrs_sp_val`
 
 We show an example of the prepared training data and explain the data format and shapes. 
 * `query` (string list containing only 1 string)
     * For each training sample, there should be 1 query field.
     * eg. ["how do you del ##ete messages"]
 * `wide_ftrs` (float list) 
-    * There could be multiple wide features for each document. Therefore the wide features are really a 2-D array with shape [#documents, #wide features per document]. Since TFRecords support 1-D FloatList, we flatten the wide features in the preparation and transform to grouped features in `train/data_fn.py` by reshaping. Therefore the float list of wide_ftrs in the training data has `#documents * #wide features per document = 10 * 3 = 30` entries.
-    * [0.305 0.264 0.180 0.192 0.136 0.027 0.273 0.273 0.377 0.233 0.264 0.227 0.119 0.119 0.198 0.212 0.274 0.047 0.320 0.255 0.350 0.000 0.000 0.357 0.301 0.367 0.292 0.000 0.000 0.170]
+    * There could be multiple dense wide features for each document. Therefore the dense wide features are a 2-D array 
+    with shape [#documents, #dense wide features per document]. Since TFRecords support 1-D FloatList, we flatten the 
+    dense wide features in the preparation and transform to grouped features in `train/data_fn.py` by reshaping. 
+    Therefore the float list of wide_ftrs in the training data has `#documents * #dense wide features per document = 4 * 3 = 12` entries. 
+    The dense wide features belong to each document sequentially. I.e., the first 3 wide features belong to the first 
+    document, the second 3 wide features belong to the second document, etc..
+    * [0.305 0.264 0.180 0.192 0.136 0.027 0.273 0.273 0.377 0.233 0.264 0.227]
+* `wide_ftrs_sp_idx` (int list) 
+    * There could be multiple sparse wide features for each document. Therefore the sparse wide features indices are a 
+    2-D array with shape [#documents, #max num of sparse wide features among documents]. Since TFRecords support 1-D 
+    IntList, we flatten the sparse wide features indices in the preparation and transform to grouped features in 
+    `train/data_fn.py` by reshaping. Therefore the int list of wide_ftrs_sp_idx in the training data has 
+    `#sum_i(num documents in list i * #max sparse wide features in list i)` entries. 
+    Within the same list, if the number 
+    of sparse feature of document m is smaller than max number of sparse wide features in the list, the sparse feature 
+    indices must be padded with 0. An example below shows the wide_ftrs_sp_idx for a list where the maximum number 
+    of sparse wide features is 2 and the list has 4 documents. The sparse wide features belong to each document 
+    sequentially. I.e., the first 2 wide features belong to the first document, the second 2 wide features belong to 
+    the second document, etc.. 
+    Note that **0 should NEVER be used for wide_ftrs_sp_idx except for padding**.
+    * [3 2 5000 20 1 0 8 0]
+* `wide_ftrs_sp_val` (float list) 
+   * Sparse wide feature values are in the same shape and must be correspondent to as sparse wide feature indices. I.e., 
+   if the sparse feature indices of list i is [1, 5, 2], then the sparse feature values [-5.0, 12.0, 11.0] means that 
+   the sparse wide features for this list is [-5.0, 11.0, 0.0, 0.0, 12.0]. If this field is missing, values 
+   corresponding to sparse wide feature indices will be set to 1 by default. Values corresponding to padding values of 
+   sparse wide feature indices must be set to 0.
+   * [3 2 5000 20 1 0 8 0] 
 * `label` (float list)
     * The labels corresponding to each document. In our example, 0 for documents without any click and 1 for documents with clicks.
     * [0 0 1 0 0 0 0 0 0 0]
@@ -31,6 +60,8 @@ The following example (from [run_detext.sh](src/detext/resources/run_detext.sh))
 The train/dev/test datasets are prepared in the format mentioned in the previous section. More specifically, the following fields are used:
 * `query`
 * `wide_ftrs`
+* `wide_ftrs_sp_idx`
+* `wide_ftrs_sp_val`
 * `doc_titles`
 * `label`
 
@@ -80,15 +111,15 @@ A complete list of training parameters that DeText provides is given below. User
 | Network       | ftr_ext                  | str      | cnn, bert, lstm, lstm_lm                                                                 |                                        | NLP feature extraction module.                                                                                                                                                                          |
 |               | num_units                | int      |                                                                                          | 128                                    | word embedding size.                                                                                                                                                                                    |
 |               | num_units_for_id_ftr     | int      |                                                                                          | 128                                    | id feature embedding size.                                                                                                                                                                              |
-|               | num_usr_fields           | int      |                                                                                          | 0                                      | number of user fields.                                                                                                                                                                                  |
 |               | num_hidden               | str      |                                                                                          | 0                                      | hidden size. This could be a number or a list of comma separated numbers for multiple hidden layers.                                                                                                    |
 |               | num_wide                 | int      |                                                                                          | 0                                      | number of wide features per doc.                                                                                                                                                                        |
 |               | ltr_loss_fn              | str      |                                                                                          | pairwise                               | learning-to-rank method.                                                                                                                                                                                |
 |               | use_deep                 | str2bool |                                                                                          | TRUE                                   | Whether to use deep features.                                                                                                                                                                           |
 |               | elem_rescale             | str2bool |                                                                                          | TRUE                                   | Whether to perform elementwise rescaling.                                                                                                                                                               |
 |               | emb_sim_func             | str      |                                                                                          | inner                                  | The approach to compute query/doc similarity scores: inner/hadamard/concat or any combination of them separater by comma.                                                                               |
-|               | num_wide_sp              | int      |                                                                                          | None                                   | number of sparse wide features per doc                                                                                                                                                                  |
 |               | num_classes              | int      |                                                                                          | 1                                      | Number of classes for multi-class classification tasks. This should be set to the number of classes in the multiclass classification task.                                                              |
+| Sparse feature related              | num_wide_sp              | int      |                                                                                          | None                                   | maximum number of sparse wide features|
+|                                     | sp_emb_size              | int      |                                                                                          | 1                                      | embedding size of sparse wide features|
 |               |                          |          |                                                                                          |                                        |                                                                                                                                                                                                         |
 | CNN related   | filter_window_sizes      | str      |                                                                                          | "1,2,3"                                | CNN filter window sizes.                                                                                                                                                                                |
 |               | num_filters              | int      |                                                                                          | 100                                    | number of CNN filters.                                                                                                                                                                                  |
@@ -98,7 +129,7 @@ A complete list of training parameters that DeText provides is given below. User
 |               | bert_config_file         | str      |                                                                                          | None                                   | bert config.                                                                                                                                                                                            |
 |               | bert_checkpoint          | str      |                                                                                          | None                                   | pretrained bert model checkpoint.                                                                                                                                                                       |
 |               |                          |          |                                                                                          |                                        |                                                                                                                                                                                                         |
-| LSTM related  | unit_type                | str      | lstm,gru,layer_norm_lstm                                                                 | lstm                                   | RNN cell unit type. Support lstm/gru/layer_norm_lstm                                                                                                                                                    |
+| LSTM related  | unit_type                | str      | lstm                                                                                     | lstm                                                       | RNN cell unit type. Currently only supports lstm
 |               | num_layers               | int      |                                                                                          | 1                                      | RNN layers                                                                                                                                                                                              |
 |               | num_residual_layers      | int      |                                                                                          | 0                                      | Number of residual layers from top to bottom. For example, if `num_layers=4` and `num_residual_layers=2`, the last 2 RNN cells in the returned list will be wrapped with `ResidualWrapper`.             |
 |               | forget_bias              | float    |                                                                                          | 1                                      | Forget bias of RNN cell                                                                                                                                                                                 |

diff --git a/detext_model_architecture.png b/detext_model_architecture.png
diff --git a/setup.py b/setup.py
@@ -15,7 +15,7 @@
                  "Intended Audience :: Developers",
                  "License :: OSI Approved"],
     license='BSD-2-CLAUSE',
-    version='1.1.5',
+    version='1.2.0',
     package_dir={'': 'src'},
     packages=setuptools.find_packages('src'),
     include_package_data=True,