Sutton’s Temporal-Difference Learning

A critical aspect of research is the reproduction of previously published results. Yet most will find reproduction of research challenging since important parameters needed to reproduce results are often not stated in the papers. I’ve noticed nn the past 5 years there has been a sort of catharsis regarding the lack of reproducibility [1][2][3]. This isn’t an issue for wetlab science alone [4]. The obvious benefit of reproduction is to aid in your own understanding of the results. This then enables one to extend and compare new contributions to existing publications.

As a grad student working on Reinforcement Learning problems I struggled with the intuition behind Sutton’s Learning to Predict by the Methods of Temporal Differences paper. In this post we’ll reproduce the paper’s  “A Random-Walk” example (Page 19 Section 3.2). I recommend at least skimming the paper and/or ch. 6 & 12 of Sutton’s latest Reinforcement Learning textbook (available for free). At minimum it would help if the reader has a basic understanding of Value-Functions and Q-Learning.

Temporal-Difference (TD) algorithms work without an explicit model and learn from the experience and outcomes of iterating over multiple episodes (or sequences). TD Learning is similar to Monte Carlo methods, however TD can learn from individual steps without needing the final result. These are among other differences:

Monte Carlo methods Temporal Difference learning
MC must wait until the end of the episode before the return is known. TD can learn online after every step and does not need to wait until the end of episode.
MC has high variance and low bias. TD has low variance and some decent bias.
MC does not exploit the Markov property. TD exploits the Markov property.

Rather than computing the estimate of a next state, TD can estimate n-steps into future. In the case of TD-λ , we use lambda to set the myopicness of our reward emphasis.  The value of λ can be optimized to for a performance/speed tradeoff. The λ parameter is also called the trace decay parameter, with 0 ≤ λ ≤ 1. The higher the value, the longer lasting the traces. In this case, a larger proportion of credit from a reward can be given to more distant states and actions. λ = 1 is essentially Monte Carlo.

Topic Modeling Amazon Reviews

Adapted from Biel 2011

I found Professor Julian McAuley’s work at UCSD when I was searching for academic work identifying the ontology and utility of products on Amazon. Professor McAuley and his students have accomplished impressive work inferring networks of substitutable and complementary items. They constructed a browseable product graph of related products and discovered topics or ‘microcategories’ that are associated with product relationships to infer networks of substitutable and complementary products. Much of this work utilizes topic modeling, and as I’ve never applied it in academia or work, this blog will be a practical intro to Latent Dirichlet Allocation (LDA) through code.

More broadly what can we do with and what do we need to know about LDA?

  • It is an Unsupervised Learning Technique that assumes documents are produced from a mixture of topics
  • LDA extracts key topics and themes from a large corpus of text
  • Each topic is a ordered list of representative words (Order is based on importance of word to a Topic)
  • LDA describes each document in the corpus based on allocation to the extracted topics
  • Many domain specific methods to create training datasets
  • It is easy to use for exploratory analysis

We’ll be using a subset (reviews_Automotive_5.json.gz) of the 142.8 million reviews spanning May 1996 – July 2014 that Julian and his team have compiled and provided in a very convenient manner on their site.

A wild dataset has appeared! Now what?


Where do we start when we stumble across a dataset we don’t know much about? Lets say one where we don’t necessarily understand the underlying generative process for some or all of the variables. Lets assume for now we’re sure there aren’t one off interventions or level shifts in the data, and we don’t know anything about the distribution of the features, trends, seasonality, model parameters, variance, etc.

I tend to start with the simplest, most interpretable models first, regardless if the problem requires classification, regression, or causality modeling. This allows me to assess how difficult the problem is before wasting time applying a complex solution.

The IPython notebook below will outline exploratory analysis in terms of 1) Histograms and Aggregation, 2) Correlation Structure , 3) Dimensional Reduction. Note this isn’t meant to be an exhaustive effort to enumerate all types of imputation and pre-processing, but a quick examination of some best practices.