# Reinforcement Learning: Chapter 12

# 12 Eligibility Traces

## 12.1 The $\lambda$-return

average different updates

the off-line $\lambda$-return algorithm

## 12.2 TD($\lambda$)

TD($\lambda$) improves over the off-line $\lambda$-return algorithm in three ways.

- First, it updates the weight vector on every step of an episode rather than only at the end, and thus its estimates may be better sooner.
- Second, its computations are equally distributed in time rather that all at the end of the episode.
- Third, it can be applied to continuing problems rather than just episodic problems.

With function approximation, the eligibility trace is a vector $z_t \in R^d$ with the same number of components as the weight vector $w_t$. Whereas the weight vector is a long-term memorym accumulating over the lifetime of the system, the eligibility trace is a short-term memory, typcially lasting less time than the length of an episode.

**Semi-gradient TD($\lambda$) for esitimating $v \approx v_{\pi}$**

Input: the policy $\pi$ to be evaluated

Input: a differentiable function $v : S \times R^d \rightarrow R$ such that $v(terminal,\dots) = 0$

Initialize value-function weights $w$ arbitrarily (e.g., $w = 0$)

Repeat (for each epsiode):

- Initialize $S$
- $z \leftarrow 0$
- Repeat (for each step of epsiode):
- Choose
`$A \sim \pi( a | S)$`

- Take action $A$, observe $R, S’$
- $z \leftarrow \gamma \lambda z + \nabla v(S, w)$
- $\delta \leftarrow R + \gamma v(S’, w) - v(S, w)$
- $w \leftarrow w + \alpha \delta z$
- $S \leftarrow S’$

- Choose
- Until $S’$ is terminal

## 12.3 n-step Truncated $\lambda$-return Methods

## 12.4 Reding Updates: The Online $\lambda$-return Algorithm

The idea is that, on each time step as you gather a new increment of data, you go back and redo all the updates since the beginning of the current epsiode.

## 12.5 True Online TD($\lambda$)

The on-line $\lambda$-return algorithm is currently the best performing temporal-difference algorithm.

Let $w_t = w_t^t$, the strategy is to find a compact, efficient way of computing each $w_t$ from $w_{t-1}$.

For the linear approximation, $v(s, w) = w^T x(s)$:

where $x_t = x(S_t)$, and

**True Online TD($\lambda$) for estimating $w^T x \approx v_{\pi}$**

Input: the policy $\pi$ to be evaluated

Initialize value function weights $w$ arbitrarily

Repeat (for each episode):

- Initialize state and obtain initial feature vector $x$
- $z \leftarrow 0$
- $V_{old} \leftarrow 0$
- Repeat (for each step of episode):
- Choose $A \sim \pi$
- Take action $A$, observe $R, x’$(feature vector of the next state)
- $V \leftarrow w^T x$
- $V’ \leftarrow w^T x’$
- $\delta \leftarrow R + \gamma V’ - V$
- $z \leftarrow \gamma \lambda z + (1 - \alpha \gamma z^T x) x$
- $w \leftarrow w + \alpha (\delta + V - V_{old}) z - \alpha (V - V_{old}) x$
- $V_{old} \leftarrow V’$
- $x \leftarrow x’$

- until $x’ = 0$ (signaling arrival at a terminal state)

## 12.6 Dutch Traces Monte Carlo Learning

The linear version of the gradient Monte Carlo prediction algorithm

Two improvements:

- do some computation during each step and less at the end.
- remove the need to store the feature vectors at each step for use later at the end.

where $F_t = I - \alpha x_t x_t^T$ is a forgetting, or fading, matrix. Now, recursing,

where $a_t$ and $z_t$ are two auxilary memory vectors that can be unpdated incrementally without knowledge of $G$ and with $O(d)$ complexity per time step.

## 12.7 Sarsa($\lambda$)

**Sarsa($\lambda$) with binary features and linear function approximation for estimating $q \approx q_\pi$ or $q \approx q_*$**

**Example 12.1 Traces in Gridworld**

**Example 12.2 Sarsa($\lambda$) on Mountain Car**

**True Online Sarsa($\lambda$) for estimating $w^T x \approx q_\pi$ or $q_*$**

## 12.8 Variable $\lambda$ and $\gamma$

Each time step has a different $\lambda$ and $\gamma$, denoted $\lambda_t$ and $\gamma_t$.

We change notation so that $\lambda: S \times A \rightarrow [0, 1]$, such that $\lambda_t = \lambda(S_t, A_t)$, and $\gamma: S \times A \rightarrow [0, 1]$, such that $\gamma_t = \gamma(S_t, A_t)$.

## 12.9 Off-policy Eligibility Traces

## 12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)

## 12.11 Stable Off-policy Methods with Traces

GTD($\lambda$): the eligibility-trace algorithm analogous to TDC, the better of the two state-value Gradient-TD prediction algorithm.

GQ($\lambda$): the Gradient-TD algorithm for action values with eligibility traces.

HTD($\lambda$): a hybrid state-value algorithm combining aspects of GTD($\lambda$) and TD($\lambda$).

Emphatic TD($\lambda$): an extension of the one-step Emphatic-TD algorithm.