Why Temporal Difference Beats Monte Carlo: Mastering the Bellman Equation

Explore how the Bellman equation underpins reinforcement learning, comparing Dynamic Programming, Monte Carlo, and Temporal‑Difference methods, and discover why TD’s low‑variance, online updates make it a powerful bridge between model‑based planning and sample‑based learning.

Didi Tech
Didi Tech
Didi Tech
Why Temporal Difference Beats Monte Carlo: Mastering the Bellman Equation

Introduction

This article explains the core of reinforcement‑learning problems – the Bellman equation – and shows how three families of algorithms (Dynamic Programming, Monte Carlo and Temporal Difference) solve it in different ways.

Bellman Equation

For a given policy π, the state‑value function satisfies the Bellman expectation equation:

v_π(s) = Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [ r + γ v_π(s') ]

The equation reveals a recursive structure: the value of a state equals the immediate reward plus the discounted value of the next state.

Bellman equation
Bellman equation

Dynamic Programming (DP)

DP assumes a complete model p(s',r|s,a). With full knowledge, the Bellman equation becomes a linear system that can be solved by iterating over all actions and successor states – a process called full‑backup . The value‑iteration update is: V(s) ← Σ_a π(a|s) Σ_{s',r} p(s',r|s,a) [ r + γ V(s') ] DP enjoys optimality guarantees but suffers from two engineering limits: the model is often unavailable and the state space can be huge (the “curse of dimensionality”).

DP full backup
DP full backup

Monte Carlo (MC)

MC discards the model and estimates the value by averaging the returns G_t = R_{t+1}+γR_{t+2}+…+γ^{T-1}R_T observed at the end of each episode. The update rule is: V(s) ← V(s) + α ( G_t – V(s) ) Because it uses complete episodes, MC has high variance and cannot be applied to continuing tasks, but it requires no knowledge of p or r.

Monte Carlo return
Monte Carlo return

Temporal Difference (TD)

TD combines the low‑variance, online nature of DP with the model‑free sampling of MC. It updates after each step using the TD target : R_{t+1} + γ V(S_{t+1}) The TD error is δ_t = R_{t+1} + γ V(S_{t+1}) – V(S_t) and the TD(0) update is: V(S_t) ← V(S_t) + α δ_t TD reduces variance compared with MC and can learn online, but introduces bias because it relies on the current estimate V(S_{t+1}). The bias‑variance trade‑off can be tuned with n‑step TD :

G_t^{(n)} = R_{t+1}+γR_{t+2}+…+γ^{n-1}R_{t+n}+γ^n V(S_{t+n})

When n=1 we obtain TD(0); as n→∞ we recover MC.

TD backup diagram
TD backup diagram

SARSA vs. Q‑Learning

Both are TD control algorithms. SARSA is on‑policy: it updates using the action actually taken ( A_{t+1}), so the update target is R_{t+1}+γ Q(S_{t+1},A_{t+1}). This makes SARSA more conservative and suitable for high‑risk environments.

Q‑Learning is off‑policy: it always uses the maximal next‑action value, max_a Q(S_{t+1},a), regardless of the behavior policy. This yields faster convergence to the optimal greedy policy but can be overly aggressive.

SARSA update
SARSA update
Q‑Learning update
Q‑Learning update

Conclusion

Temporal‑Difference learning is a fundamental building block of modern reinforcement learning. By blending bootstrapping with sampling, it offers a flexible bias‑variance trade‑off, supports n‑step extensions, and underlies popular deep RL algorithms such as DQN and A3C. Understanding the Bellman equation, DP, MC, and TD – and the distinction between on‑policy (SARSA) and off‑policy (Q‑Learning) control – is essential for mastering advanced AI techniques.

TD as a core RL component
TD as a core RL component
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

dynamic programmingMonte Carloq-learningsarsaBellman Equationtemporal-difference
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.