Sutton’s Temporal-Difference Learning

A critical aspect of research is the reproduction of previously published results. Yet most will find reproduction of research challenging since important parameters needed to reproduce results are often not stated in the papers. I’ve noticed nn the past 5 years there has been a sort of catharsis regarding the lack of reproducibility [1][2][3]. This isn’t an issue for wetlab science alone [4]. The obvious benefit of reproduction is to aid in your own understanding of the results. This then enables one to extend and compare new contributions to existing publications.

As a grad student working on Reinforcement Learning problems I struggled with the intuition behind Sutton’s Learning to Predict by the Methods of Temporal Differences paper. In this post we’ll reproduce the paper’s  “A Random-Walk” example (Page 19 Section 3.2). I recommend at least skimming the paper and/or ch. 6 & 12 of Sutton’s latest Reinforcement Learning textbook (available for free). At minimum it would help if the reader has a basic understanding of Value-Functions and Q-Learning.

Temporal-Difference (TD) algorithms work without an explicit model and learn from the experience and outcomes of iterating over multiple episodes (or sequences). TD Learning is similar to Monte Carlo methods, however TD can learn from individual steps without needing the final result. These are among other differences:

Monte Carlo methods Temporal Difference learning
MC must wait until the end of the episode before the return is known. TD can learn online after every step and does not need to wait until the end of episode.
MC has high variance and low bias. TD has low variance and some decent bias.
MC does not exploit the Markov property. TD exploits the Markov property.

Rather than computing the estimate of a next state, TD can estimate n-steps into future. In the case of TD-λ , we use lambda to set the myopicness of our reward emphasis.  The value of λ can be optimized to for a performance/speed tradeoff. The λ parameter is also called the trace decay parameter, with 0 ≤ λ ≤ 1. The higher the value, the longer lasting the traces. In this case, a larger proportion of credit from a reward can be given to more distant states and actions. λ = 1 is essentially Monte Carlo.