# Reinforcement Learning 5: Monte Carlo Methods

# 5 Monte Carlo Methods

do not assume complete knowledge of the environment

the model need only generate sample transitions, not the complete probability distributions of all possible transitions

## 5.1 Monte Carlo Prediction

**The first-visit MC method**

The every-visit MC method

Example 5.1: Blackjack

Example 5.2: Soap Bubble

## 5.2 Monte Carlo Estimation of Action Values

maintaining exploration -> exploring starts

## 5.3 Monte Carlo Control

**Monte Carlo with Exploring Starts**

Example 5.3 Solving Blackjack

## 5.4 Monte Carlo Control without Exploring Starts

on-policy methods and off-policy methods $\epsilon$-greedy policy

For any $\epsilon$-soft policy, $\pi$, any $\epsilon$-greedy policy with respect to $q_\pi$ is guaranteed to be better than or equal to $\pi$.

**On-policy first-visit MC control (for $\epsilon$-soft policies)**

## 5.5 Off-policy Prediction via Importance Sampling

explore and learn

A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior.

The policy being learned about is called the *target policy*, and the policy used to generate behavior is called the *behavior policy*.

In this case we say that learning is from data “off” the target policy, and the overall process is termed *off-policy learning*.

*importance sampling*

*importance-sampling ratio*

*ordinary importance sampling*: unbiased, but variance could be large

*weighted importance sampling*: biased, but variance is small

**Example 5.4: Off-policy Estimation of a Blackjack State Value**

**Example 5.5: Infinite Variance**

## 5.6 Incremental Implementation

**Off-policy MC prediction**

## 5.7 Off-policy Monte Carlo Control

**Off-policy MC control** why not sample from policy $\pi$ directly?

**Exercise 5.8: Racetrack**

## 5.8 *Discounting-aware Importance Sampling

rewrite formula with discounting

## 5.9 *Per-reward Importance Sampling

rewrite formula with discounting

## 5.10 Summary

learning without a model do not bootstrap