6 Temporal-Difference Learning

6.1 TD Prediction

Tabular TD(0) for estimating $v_\pi$

Example 6.1: Driving Home

Example 6.2 Random Walk

Example 6.3: Random walk under batch updating

Example 6.4: You are the Predictor

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]\]

Sarsa (on-policy TD control) for estimating $Q \approx q*$

Example 6.5: Windy Gridworld

Q-learning (off-policy TD control) for estimating $\pi \approx \pi*$

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \max_a Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]\]

Example 6.6: Cliff Walking

\[Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \mathbb{E}[Q(S_{t+1}, A_{t+1}) | S_{t+1}] - Q(S_t, A_t)]\]

Example 6.7: Maximization Bias Example

Double Q-learning