6 Temporal-Difference Learning

6.1 TD Prediction

Tabular TD(0) for estimating $v_\pi$

Example 6.1: Driving Home

6.2 Advantages of TD Prediction Methods

Example 6.2 Random Walk

6.3 Optimality of TD(0)

Example 6.3: Random walk under batch updating

Example 6.4: You are the Predictor

6.4 Sarsa: On-policy TD Control

Sarsa (on-policy TD control) for estimating $Q \approx q*$

Example 6.5: Windy Gridworld

6.5 Q-learning: Off-policy TD Control

Q-learning (off-policy TD control) for estimating $\pi \approx \pi*$

Example 6.6: Cliff Walking

6.6 Expected Sarsa

6.7 Maximization Bias and Double Learning

Example 6.7: Maximization Bias Example

Double Q-learning

6.8 Games, Afterstates, and Other Special Cases

6.9 Summary