12 Eligibility Traces

12.1 The $\lambda$-return

average different updates

the off-line $\lambda$-return algorithm

12.2 TD($\lambda$)

TD($\lambda$) improves over the off-line $\lambda$-return algorithm in three ways.

  • First, it updates the weight vector on every step of an episode rather than only at the end, and thus its estimates may be better sooner.
  • Second, its computations are equally distributed in time rather that all at the end of the episode.
  • Third, it can be applied to continuing problems rather than just episodic problems.

With function approximation, the eligibility trace is a vector $z_t \in R^d$ with the same number of components as the weight vector $w_t$. Whereas the weight vector is a long-term memorym accumulating over the lifetime of the system, the eligibility trace is a short-term memory, typcially lasting less time than the length of an episode.

Semi-gradient TD($\lambda$) for esitimating $v \approx v_{\pi}$

Input: the policy $\pi$ to be evaluated

Input: a differentiable function $v : S \times R^d \rightarrow R$ such that $v(terminal,\dots) = 0$

Initialize value-function weights $w$ arbitrarily (e.g., $w = 0$)

Repeat (for each epsiode):

  • Initialize $S$
  • $z \leftarrow 0$
  • Repeat (for each step of epsiode):
    • Choose $A \sim \pi( a | S)$
    • Take action $A$, observe $R, S’$
    • $z \leftarrow \gamma \lambda z + \nabla v(S, w)$
    • $\delta \leftarrow R + \gamma v(S’, w) - v(S, w)$
    • $w \leftarrow w + \alpha \delta z$
    • $S \leftarrow S’$
  • Until $S’$ is terminal

12.3 n-step Truncated $\lambda$-return Methods

12.4 Reding Updates: The Online $\lambda$-return Algorithm

The idea is that, on each time step as you gather a new increment of data, you go back and redo all the updates since the beginning of the current epsiode.

12.5 True Online TD($\lambda$)

The on-line $\lambda$-return algorithm is currently the best performing temporal-difference algorithm.

Let $w_t = w_t^t$, the strategy is to find a compact, efficient way of computing each $w_t$ from $w_{t-1}$.

For the linear approximation, $v(s, w) = w^T x(s)$:

where $x_t = x(S_t)$, and

True Online TD($\lambda$) for estimating $w^T x \approx v_{\pi}$

Input: the policy $\pi$ to be evaluated

Initialize value function weights $w$ arbitrarily

Repeat (for each episode):

  • Initialize state and obtain initial feature vector $x$
  • $z \leftarrow 0$
  • $V_{old} \leftarrow 0$
  • Repeat (for each step of episode):
    • Choose $A \sim \pi$
    • Take action $A$, observe $R, x’$(feature vector of the next state)
    • $V \leftarrow w^T x$
    • $V’ \leftarrow w^T x’$
    • $\delta \leftarrow R + \gamma V’ - V$
    • $z \leftarrow \gamma \lambda z + (1 - \alpha \gamma z^T x) x$
    • $w \leftarrow w + \alpha (\delta + V - V_{old}) z - \alpha (V - V_{old}) x$
    • $V_{old} \leftarrow V’$
    • $x \leftarrow x’$
  • until $x’ = 0$ (signaling arrival at a terminal state)

12.6 Dutch Traces Monte Carlo Learning

The linear version of the gradient Monte Carlo prediction algorithm

Two improvements:

  • do some computation during each step and less at the end.
  • remove the need to store the feature vectors at each step for use later at the end.

where $F_t = I - \alpha x_t x_t^T$ is a forgetting, or fading, matrix. Now, recursing,

where $a_t$ and $z_t$ are two auxilary memory vectors that can be unpdated incrementally without knowledge of $G$ and with $O(d)$ complexity per time step.

12.7 Sarsa($\lambda$)

Sarsa($\lambda$) with binary features and linear function approximation for estimating $q \approx q_\pi$ or $q \approx q_*$

Example 12.1 Traces in Gridworld

Example 12.2 Sarsa($\lambda$) on Mountain Car

True Online Sarsa($\lambda$) for estimating $w^T x \approx q_\pi$ or $q_*$

12.8 Variable $\lambda$ and $\gamma$

Each time step has a different $\lambda$ and $\gamma$, denoted $\lambda_t$ and $\gamma_t$.

We change notation so that $\lambda: S \times A \rightarrow [0, 1]$, such that $\lambda_t = \lambda(S_t, A_t)$, and $\gamma: S \times A \rightarrow [0, 1]$, such that $\gamma_t = \gamma(S_t, A_t)$.

12.9 Off-policy Eligibility Traces

12.10 Watkins’s Q($\lambda$) to Tree-Backup($\lambda$)

12.11 Stable Off-policy Methods with Traces

GTD($\lambda$): the eligibility-trace algorithm analogous to TDC, the better of the two state-value Gradient-TD prediction algorithm.

GQ($\lambda$): the Gradient-TD algorithm for action values with eligibility traces.

HTD($\lambda$): a hybrid state-value algorithm combining aspects of GTD($\lambda$) and TD($\lambda$).

Emphatic TD($\lambda$): an extension of the one-step Emphatic-TD algorithm.

12.12 Implementation Issues

12.13 Conclusions