PageRenderTime 42ms CodeModel.GetById 14ms RepoModel.GetById 0ms app.codeStats 0ms

/tensorflow/g3doc/tutorials/wide_and_deep/index.md

https://gitlab.com/admin-github-cloud/tensorflow
Markdown | 275 lines | 223 code | 52 blank | 0 comment | 0 complexity | 48d6ec4bf74363e698150664db4bc0ee MD5 | raw file
  1. # TensorFlow Wide & Deep Learning Tutorial
  2. In the previous [TensorFlow Linear Model Tutorial](../wide/),
  3. we trained a logistic regression model to predict the probability that the
  4. individual has an annual income of over 50,000 dollars using the [Census Income
  5. Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). TensorFlow is
  6. great for training deep neural networks too, and you might be thinking which one
  7. you should chooseWell, why not both? Would it be possible to combine the
  8. strengths of both in one model?
  9. In this tutorial, we'll introduce how to use the TF.Learn API to jointly train a
  10. wide linear model and a deep feed-forward neural network. This approach combines
  11. the strengths of memorization and generalization. It's useful for generic
  12. large-scale regression and classification problems with sparse input features
  13. (e.g., categorical features with a large number of possible feature values). If
  14. you're interested in learning more about how Wide & Deep Learning works, please
  15. check out our [research paper](http://arxiv.org/abs/1606.07792).
  16. ![Wide & Deep Spectrum of Models]
  17. (../../images/wide_n_deep.svg "Wide & Deep")
  18. The figure above shows a comparison of a wide model (logistic regression with
  19. sparse features and transformations), a deep model (feed-forward neural network
  20. with an embedding layer and several hidden layers), and a Wide & Deep model
  21. (joint training of both). At a high level, there are only 3 steps to configure a
  22. wide, deep, or Wide & Deep model using the TF.Learn API:
  23. 1. Select features for the wide part: Choose the sparse base columns and
  24. crossed columns you want to use.
  25. 1. Select features for the deep part: Choose the continuous columns, the
  26. embedding dimension for each categorical column, and the hidden layer sizes.
  27. 1. Put them all together in a Wide & Deep model
  28. (`DNNLinearCombinedClassifier`).
  29. And that's it! Let's go through a simple example.
  30. ## Setup
  31. To try the code for this tutorial:
  32. 1. [Install TensorFlow](../../get_started/os_setup.md) if you haven't
  33. already.
  34. 2. Download [the tutorial code](
  35. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py).
  36. 3. Install the pandas data analysis library. tf.learn doesn't require pandas, but it does support it, and this tutorial uses pandas. To install pandas:
  37. 1. Get `pip`:
  38. ```shell
  39. # Ubuntu/Linux 64-bit
  40. $ sudo apt-get install python-pip python-dev
  41. # Mac OS X
  42. $ sudo easy_install pip
  43. $ sudo easy_install --upgrade six
  44. ```
  45. 2. Use `pip` to install pandas:
  46. ```shell
  47. $ sudo pip install pandas
  48. ```
  49. If you have trouble installing pandas, consult the [instructions]
  50. (http://pandas.pydata.org/pandas-docs/stable/install.html) on the pandas site.
  51. 4. Execute the tutorial code with the following command to train the linear
  52. model described in this tutorial:
  53. ```shell
  54. $ python wide_n_deep_tutorial.py --model_type=wide_n_deep
  55. ```
  56. Read on to find out how this code builds its linear model.
  57. ## Define Base Feature Columns
  58. First, let's define the base categorical and continuous feature columns that
  59. we'll use. These base columns will be the building blocks used by both the wide
  60. part and the deep part of the model.
  61. ```python
  62. import tensorflow as tf
  63. # Categorical base columns.
  64. gender = tf.contrib.layers.sparse_column_with_keys(column_name="gender", keys=["female", "male"])
  65. race = tf.contrib.layers.sparse_column_with_keys(column_name="race", keys=[
  66. "Amer-Indian-Eskimo", "Asian-Pac-Islander", "Black", "Other", "White"])
  67. education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)
  68. marital_status = tf.contrib.layers.sparse_column_with_hash_bucket("marital_status", hash_bucket_size=100)
  69. relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
  70. workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
  71. occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
  72. native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)
  73. # Continuous base columns.
  74. age = tf.contrib.layers.real_valued_column("age")
  75. age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
  76. education_num = tf.contrib.layers.real_valued_column("education_num")
  77. capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
  78. capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
  79. hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")
  80. ```
  81. ## The Wide Model: Linear Model with Crossed Feature Columns
  82. The wide model is a linear model with a wide set of sparse and crossed feature
  83. columns:
  84. ```python
  85. wide_columns = [
  86. gender, native_country, education, occupation, workclass, marital_status, relationship, age_buckets,
  87. tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4)),
  88. tf.contrib.layers.crossed_column([native_country, occupation], hash_bucket_size=int(1e4)),
  89. tf.contrib.layers.crossed_column([age_buckets, race, occupation], hash_bucket_size=int(1e6))]
  90. ```
  91. Wide models with crossed feature columns can memorize sparse interactions
  92. between features effectively. That being said, one limitation of crossed feature
  93. columns is that they do not generalize to feature combinations that have not
  94. appeared in the training data. Let's add a deep model with embeddings to fix
  95. that.
  96. ## The Deep Model: Neural Network with Embeddings
  97. The deep model is a feed-forward neural network, as shown in the previous
  98. figure. Each of the sparse, high-dimensional categorical features are first
  99. converted into a low-dimensional and dense real-valued vector, often referred to
  100. as an embedding vector. These low-dimensional dense embedding vectors are
  101. concatenated with the continuous features, and then fed into the hidden layers
  102. of a neural network in the forward pass. The embedding values are initialized
  103. randomly, and are trained along with all other model parameters to minimize the
  104. training loss. If you're interested in learning more about embeddings, check out
  105. the TensorFlow tutorial on [Vector Representations of Words]
  106. (https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html), or
  107. [Word Embedding](https://en.wikipedia.org/wiki/Word_embedding) on Wikipedia.
  108. We'll configure the embeddings for the categorical columns using
  109. `embedding_column`, and concatenate them with the continuous columns:
  110. ```python
  111. deep_columns = [
  112. tf.contrib.layers.embedding_column(workclass, dimension=8),
  113. tf.contrib.layers.embedding_column(education, dimension=8),
  114. tf.contrib.layers.embedding_column(marital_status, dimension=8),
  115. tf.contrib.layers.embedding_column(gender, dimension=8),
  116. tf.contrib.layers.embedding_column(relationship, dimension=8),
  117. tf.contrib.layers.embedding_column(race, dimension=8),
  118. tf.contrib.layers.embedding_column(native_country, dimension=8),
  119. tf.contrib.layers.embedding_column(occupation, dimension=8),
  120. age, education_num, capital_gain, capital_loss, hours_per_week]
  121. ```
  122. The higher the `dimension` of the embedding is, the more degrees of freedom the
  123. model will have to learn the representations of the features. For simplicity, we
  124. set the dimension to 8 for all feature columns here. Empirically, a more
  125. informed decision for the number of dimensions is to start with a value on the
  126. order of $$k\log_2(n)$$ or $$k\sqrt[4]n$$, where $$n$$ is the number of unique
  127. features in a feature column and $$k$$ is a small constant (usually smaller than
  128. 10).
  129. Through dense embeddings, deep models can generalize better and make predictions
  130. on feature pairs that were previously unseen in the training data. However, it
  131. is difficult to learn effective low-dimensional representations for feature
  132. columns when the underlying interaction matrix between two feature columns is
  133. sparse and high-rank. In such cases, the interaction between most feature pairs
  134. should be zero except a few, but dense embeddings will lead to nonzero
  135. predictions for all feature pairs, and thus can over-generalize. On the other
  136. hand, linear models with crossed features can memorize these exception rules
  137. effectively with fewer model parameters.
  138. Now, let's see how to jointly train wide and deep models and allow them to
  139. complement each others strengths and weaknesses.
  140. ## Combining Wide and Deep Models into One
  141. The wide models and deep models are combined by summing up their final output
  142. log odds as the prediction, then feeding the prediction to a logistic loss
  143. function. All the graph definition and variable allocations have already been
  144. handled for you under the hood, so you simply need to create a
  145. `DNNLinearCombinedClassifier`:
  146. ```python
  147. import tempfile
  148. model_dir = tempfile.mkdtemp()
  149. m = tf.contrib.learn.DNNLinearCombinedClassifier(
  150. model_dir=model_dir,
  151. linear_feature_columns=wide_columns,
  152. dnn_feature_columns=deep_columns,
  153. dnn_hidden_units=[100, 50])
  154. ```
  155. ## Training and Evaluating The Model
  156. Before we train the model, let's read in the Census dataset as we did in the
  157. [TensorFlow Linear Model tutorial](../wide/). The code for
  158. input data processing is provided here again for your convenience:
  159. ```python
  160. import pandas as pd
  161. import urllib
  162. # Define the column names for the data sets.
  163. COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
  164. "marital_status", "occupation", "relationship", "race", "gender",
  165. "capital_gain", "capital_loss", "hours_per_week", "native_country", "income_bracket"]
  166. LABEL_COLUMN = 'label'
  167. CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation",
  168. "relationship", "race", "gender", "native_country"]
  169. CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss",
  170. "hours_per_week"]
  171. # Download the training and test data to temporary files.
  172. # Alternatively, you can download them yourself and change train_file and
  173. # test_file to your own paths.
  174. train_file = tempfile.NamedTemporaryFile()
  175. test_file = tempfile.NamedTemporaryFile()
  176. urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
  177. urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)
  178. # Read the training and test data sets into Pandas dataframe.
  179. df_train = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)
  180. df_test = pd.read_csv(test_file, names=COLUMNS, skipinitialspace=True, skiprows=1)
  181. df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
  182. df_test[LABEL_COLUMN] = (df_test['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
  183. def input_fn(df):
  184. # Creates a dictionary mapping from each continuous feature column name (k) to
  185. # the values of that column stored in a constant Tensor.
  186. continuous_cols = {k: tf.constant(df[k].values)
  187. for k in CONTINUOUS_COLUMNS}
  188. # Creates a dictionary mapping from each categorical feature column name (k)
  189. # to the values of that column stored in a tf.SparseTensor.
  190. categorical_cols = {k: tf.SparseTensor(
  191. indices=[[i, 0] for i in range(df[k].size)],
  192. values=df[k].values,
  193. shape=[df[k].size, 1])
  194. for k in CATEGORICAL_COLUMNS}
  195. # Merges the two dictionaries into one.
  196. feature_cols = dict(continuous_cols.items() + categorical_cols.items())
  197. # Converts the label column into a constant Tensor.
  198. label = tf.constant(df[LABEL_COLUMN].values)
  199. # Returns the feature columns and the label.
  200. return feature_cols, label
  201. def train_input_fn():
  202. return input_fn(df_train)
  203. def eval_input_fn():
  204. return input_fn(df_test)
  205. ```
  206. After reading in the data, you can train and evaluate the model:
  207. ```python
  208. m.fit(input_fn=train_input_fn, steps=200)
  209. results = m.evaluate(input_fn=eval_input_fn, steps=1)
  210. for key in sorted(results):
  211. print "%s: %s" % (key, results[key])
  212. ```
  213. The first line of the output should be something like `accuracy: 0.84429705`. We
  214. can see that the accuracy was improved from about 83.6% using a wide-only linear
  215. model to about 84.4% using a Wide & Deep model. If you'd like to see a working
  216. end-to-end example, you can download our [example code]
  217. (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/wide_n_deep_tutorial.py).
  218. Note that this tutorial is just a quick example on a small dataset to get you
  219. familiar with the API. Wide & Deep Learning will be even more powerful if you
  220. try it on a large dataset with many sparse feature columns that have a large
  221. number of possible feature values. Again, feel free to take a look at our
  222. [research paper](http://arxiv.org/abs/1606.07792) for more ideas about how to
  223. apply Wide & Deep Learning in real-world large-scale maching learning problems.