# Reinforcement Learning: Chapter 16

# 16 Applications and Case Studies

## 16.1 TD-Gammon

Backgammon

- TD($\lambda$) + NN(Neural network)
- Backprop TD error

198 input units -> hidden units(40-80) -> predicted probability of winning -> TD error

## 16.2 Samuel’s Checkers Player

linear function approximation + TD + minimax search + alpha-beta cutoff

## 16.3 Watson’s Daily-Double Wagering

## 16.4 Optimizaing Memory Control

A reinforcement learning memory controller

MDP: precharge, activate, read, write, noop.

Using Sarsa to learn an action-value function.

States were represented by six integer-valued features.

The linear function approximation was implemented by tile coding with hashing.

## 16.5 Human-level Video Game Play

DQN, modified in three ways:

- experience replay
- fix the network in the next C updates as the Q-learning target.
- clip the error in [-1, 1]

## 16.6 Mastering the Game of Go

MCTS + ResNet

### 16.6.1 AlphaGo

### 16.6.2 AlphaGo Zero

## 16.7 Personalized Web Services

The objective is to maximize the click-through rate. Contextual bandit problem.

greedy optimization, maximizing only the probability of immeditate clicks.

life-time value optimization, improving the number of clicks users made over multiple visits to a website.