# 12 Eligibility Traces

## 12.1 The $\lambda$-return

the off-line $\lambda$-return algorithm

## 12.2 TD($\lambda$)

TD($\lambda$) improves over the off-line $\lambda$-return algorithm in three ways.

• First, it updates the weight vector on every step of an episode rather than only at the end, and thus its estimates may be better sooner.
• Second, its computations are equally distributed in time rather that all at the end of the episode.
• Third, it can be applied to continuing problems rather than just episodic problems.

With function approximation, the eligibility trace is a vector $z_t \in R^d$ with the same number of components as the weight vector $w_t$. Whereas the weight vector is a long-term memorym accumulating over the lifetime of the system, the eligibility trace is a short-term memory, typcially lasting less time than the length of an episode.

Semi-gradient TD($\lambda$) for esitimating $v \approx v_{\pi}$

Input: the policy $\pi$ to be evaluated

Input: a differentiable function $v : S \times R^d \rightarrow R$ such that $v(terminal,\dots) = 0$

Initialize value-function weights $w$ arbitrarily (e.g., $w = 0$)

Repeat (for each epsiode):

• Initialize $S$
• $z \leftarrow 0$
• Repeat (for each step of epsiode):
• Choose $A \sim \pi( a | S)$
• Take action $A$, observe $R, S’$
• $z \leftarrow \gamma \lambda z + \nabla v(S, w)$
• $\delta \leftarrow R + \gamma v(S’, w) - v(S, w)$
• $w \leftarrow w + \alpha \delta z$
• $S \leftarrow S’$
• Until $S’$ is terminal

## 12.4 Reding Updates: The Online $\lambda$-return Algorithm

The idea is that, on each time step as you gather a new increment of data, you go back and redo all the updates since the beginning of the current epsiode.

## 12.5 True Online TD($\lambda$)

The on-line $\lambda$-return algorithm is currently the best performing temporal-difference algorithm.

Let $w_t = w_t^t$, the strategy is to find a compact, efficient way of computing each $w_t$ from $w_{t-1}$.

For the linear approximation, $v(s, w) = w^T x(s)$:

where $x_t = x(S_t)$, and

True Online TD($\lambda$) for estimating $w^T x \approx v_{\pi}$

Input: the policy $\pi$ to be evaluated

Initialize value function weights $w$ arbitrarily

Repeat (for each episode):

• Initialize state and obtain initial feature vector $x$
• $z \leftarrow 0$
• $V_{old} \leftarrow 0$
• Repeat (for each step of episode):
• Choose $A \sim \pi$
• Take action $A$, observe $R, x’$(feature vector of the next state)
• $V \leftarrow w^T x$
• $V’ \leftarrow w^T x’$
• $\delta \leftarrow R + \gamma V’ - V$
• $z \leftarrow \gamma \lambda z + (1 - \alpha \gamma z^T x) x$
• $w \leftarrow w + \alpha (\delta + V - V_{old}) z - \alpha (V - V_{old}) x$
• $V_{old} \leftarrow V’$
• $x \leftarrow x’$
• until $x’ = 0$ (signaling arrival at a terminal state)

## 12.6 Dutch Traces Monte Carlo Learning

The linear version of the gradient Monte Carlo prediction algorithm

Two improvements:

• do some computation during each step and less at the end.
• remove the need to store the feature vectors at each step for use later at the end.

where $F_t = I - \alpha x_t x_t^T$ is a forgetting, or fading, matrix. Now, recursing,

where $a_t$ and $z_t$ are two auxilary memory vectors that can be unpdated incrementally without knowledge of $G$ and with $O(d)$ complexity per time step.

## 12.7 Sarsa($\lambda$)

Sarsa($\lambda$) with binary features and linear function approximation for estimating $q \approx q_\pi$ or $q \approx q_*$

Example 12.1 Traces in Gridworld

Example 12.2 Sarsa($\lambda$) on Mountain Car

True Online Sarsa($\lambda$) for estimating $w^T x \approx q_\pi$ or $q_*$

## 12.8 Variable $\lambda$ and $\gamma$

Each time step has a different $\lambda$ and $\gamma$, denoted $\lambda_t$ and $\gamma_t$.

We change notation so that $\lambda: S \times A \rightarrow [0, 1]$, such that $\lambda_t = \lambda(S_t, A_t)$, and $\gamma: S \times A \rightarrow [0, 1]$, such that $\gamma_t = \gamma(S_t, A_t)$.

$\bar{Q}_t = \sum_a \pi(a | S_t) q(S_t, a, w_{t-1})$

## 12.11 Stable Off-policy Methods with Traces

GTD($\lambda$): the eligibility-trace algorithm analogous to TDC, the better of the two state-value Gradient-TD prediction algorithm.

GQ($\lambda$): the Gradient-TD algorithm for action values with eligibility traces.

HTD($\lambda$): a hybrid state-value algorithm combining aspects of GTD($\lambda$) and TD($\lambda$).

Emphatic TD($\lambda$): an extension of the one-step Emphatic-TD algorithm.