# 9 On-policy Prediction with Approximation

## 9.2 The Prediction Objective

$VE(w) = \sum_{s \in S} \mu(s)[v_\pi(s) - \hat{v}(s,w)]^2$ Often $\mu(s)$ is chosen to be the fraction of time spent in $s$.

The on-policy distribution in episodic tasks

$w \leftarrow w + \alpha [G_t - \hat{v}(S_t, w)] \nabla \hat{v}(S_t, w)$

If $G_t$ is unbiased estimate, it’sStochastic-gradient method. If $G_t$ is biased estimate, it’s Semi-gradient method.

Example 9.1: State Aggregation on the 1000-state Random Walk

## 9.7 Least-Squares TD

LSTD for estimating $\hat{v} \approx v_{\pi}$ ($O(d^2)$ version)