Artificial Intelligence 14 min read

Monte Carlo Policy Improvement in RL: Epsilon‑Greedy, On‑Policy vs Off‑Policy, and Incremental Updates

This tutorial explains how Monte Carlo methods are enhanced in reinforcement learning through epsilon‑greedy and epsilon‑soft policies, Monte Carlo control, a Blackjack Q‑function example, the distinction between on‑policy and off‑policy learning, importance sampling, and efficient incremental update techniques.

AI Algorithm Path

May 22, 2025

Monte Carlo Policy Improvement in RL: Epsilon‑Greedy, On‑Policy vs Off‑Policy, and Incremental Updates

Soft and ε‑soft policies

A soft policy assigns non‑zero probability to every action in every state: π(a|s) > 0 for all s ∈ S, a ∈ A(s). An ε‑soft policy guarantees a minimum probability ε/|A(s)| for each action, i.e., π(a|s) ≥ ε/|A(s)|.

ε‑greedy policy

An ε‑greedy policy selects the greedy (highest‑value) action with probability 1‑ε and chooses uniformly among all actions with the remaining probability ε. Consequently each action receives at least ε/|A(s)| probability.

Monte Carlo control

The control algorithm seeks the optimal policy by repeatedly performing:

Policy evaluation : estimate the action‑value function Qπ(s,a) as the average return G observed from many sampled episodes generated under the current policy π.

Policy improvement : replace π with an ε‑greedy policy derived from the updated Qπ.

In practice a single evaluation step per improvement iteration is often sufficient. An alternative is to update Q and π after each complete trajectory, affecting only the visited state‑action pairs.

Blackjack example

Using the classic 21‑point Blackjack task, the learned Q‑value for the state where the player holds 21 (no usable ace) and the dealer shows 6, with action “hit”, converges to approximately –0.88 when the learning rate α = 0.005. The value is higher than the theoretical –1 because early training stages involve random action selection, which creates large Q‑value gaps and prevents low‑reward actions from being sampled later. When the dealer shows 7 the same Q‑value converges to –1, illustrating the impact of randomness and learning‑rate choice on convergence. With α = 0.005 convergence is slow; more than five million iterations or a larger α would bring the estimate closer to the true value.

Deriving the optimal policy

For each state, compare the Q‑values of “hit” and “stick”. The action with the higher Q‑value is selected as the optimal action. Experiments show that the policy extracted from the learned Q‑function matches the true optimal Blackjack strategy, confirming that the chosen ε and α hyper‑parameters are appropriate for this problem.

On‑policy vs off‑policy

On‑policy methods (e.g., the Monte Carlo control described above) use the same policy for generating data and for policy improvement. Off‑policy methods separate a behavior policy b, which generates the samples, from a target policy π, which is being learned. The coverage condition requires that whenever π(a|s) > 0, the behavior policy also satisfies b(a|s) > 0; otherwise the action cannot be evaluated.

Importance sampling

To estimate expectations under the target policy π while sampling from b, importance sampling re‑weights each return G by the ratio ρ = π(a|s) / b(a|s). Formally, Eπ[G] = E_b[ρ G], providing an unbiased estimate of the Q‑function under π.

Incremental update mechanism

Storing all returns for each state‑action pair leads to O(n) update cost. A recursive formula reduces this to O(1): Q_new = Q_old + (1 / N) * (G - Q_old) where N is the visit count. Introducing a constant step‑size α yields the Constant‑α MC update: Q_new = Q_old + α * (G - Q_old) A smaller α slows convergence but reduces variance near the true value; a larger α speeds convergence at the cost of higher variance.

Conclusion

The article clarifies how ε‑soft and ε‑greedy policies enable stable policy improvement within Monte Carlo control, distinguishes on‑policy from off‑policy learning, derives the importance‑sampling correction needed for off‑policy estimation, and presents efficient incremental update formulas for practical implementation.

Reference: RLBook – http://incompleteideas.net/book/RLbook2020.pdf

reinforcement learning Importance Sampling Incremental Update Epsilon-Greedy Monte Carlo Off-Policy On-Policy

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.