Artificial Intelligence 18 min read

Understanding Temporal‑Difference Algorithms in Reinforcement Learning

This tutorial explains temporal‑difference (TD) learning, compares it with dynamic programming and Monte‑Carlo methods, walks through concrete soccer‑match examples, shows one‑step TD versus constant‑α Monte‑Carlo updates, discusses convergence, bias, and introduces popular TD variants such as Sarsa, Q‑learning, Expected Sarsa and double learning.

AI Algorithm Path

May 23, 2025

Understanding Temporal‑Difference Algorithms in Reinforcement Learning

Reinforcement learning (RL) studies how an agent can learn optimal policies by interacting with a complex environment and receiving rewards. The article begins by recalling that dynamic programming (DP) updates state values using other states’ estimates (bootstrapping) and that Monte‑Carlo (MC) methods wait until an episode ends before updating.

Temporal‑Difference (TD) Learning

TD combines the strengths of DP and MC: it updates a state after a fixed number n of time steps, without needing a model of the environment. The simplest case, TD(1), updates the state immediately after the next step (one‑step TD). The TD error is defined as δ = R + γ·V(s') – V(s), where γ is the discount factor.

“The TD error measures the difference between the observed reward plus the discounted next‑state value and the current estimate.”

Because TD updates instantly, it converges faster than MC, especially when episodes are long.

Illustrative Soccer Example

The article uses a simplified Copa América tournament to compare constant‑α MC and one‑step TD. Initial value estimates for each team are shown as yellow points; the algorithm’s predictions are the yellow line, while actual cumulative goal differences are the black line. After each match, TD updates the prediction using the observed reward (goal difference) and the change in value between consecutive states.

Initial value for Chile: v = 1.2 Cumulative reward after the third match: G = 3 TD update step: Δ = α·(G – v) = 0.5·(3 – 1.2) = 0.9 New value: v = 1.2 + 0.9 = 2.1 For the fourth match (Chile vs. Ecuador), the article computes the TD error as δ = R – (v[t+1] – v[t]) = 5 – 0.6 = 4.4, multiplies by the learning rate α = 0.5 to obtain the update β = 2.2, and updates the state value to v = 1.5 + 2.2 = 3.7.

Algorithm Comparison

Monte‑Carlo adjusts the estimate toward the total episode return, while one‑step TD uses bootstrapping to incorporate the next‑state value and immediate reward, leading to faster adaptation. The article notes that TD methods theoretically guarantee convergence to the correct value function.

State‑update strategies differ: MC updates only after an episode ends, whereas one‑step TD updates immediately after each reward, which is advantageous for long episodes.

TD Variants

The tutorial introduces three widely used TD‑based control algorithms:

Sarsa – an on‑policy method that updates the action‑value Q(s,a) using the next state‑action pair (s',a') observed under the current policy.

Q‑learning – an off‑policy method that replaces the next‑state action value with the maximum over all actions, i.e., max_a Q(s',a), yielding faster learning in many problems.

Expected Sarsa – computes the expected next‑state value under the current policy, reducing variance at the cost of extra computation.

All three share the same TD update formula but differ in how the target value is estimated.

Maximization Bias

When the max operator is applied to noisy value estimates, it can produce an upward bias, especially early in learning when Q‑values are randomly initialized. The article reproduces the classic Sutton‑Barto example where a state with two actions leads the agent to over‑estimate the left action’s value, causing sub‑optimal greedy choices.

Double Learning

To mitigate maximization bias, double learning maintains two independent Q‑functions ( Q₁ and Q₂). The action with the highest value according to Q₁ is selected, but its value is evaluated using Q₂, providing an unbiased estimate. The article walks through a maze example, showing how the update equations are applied step by step.

Conclusion

Temporal‑difference methods are among the most widely used techniques in modern RL and also appear in other prediction tasks such as time‑series forecasting. The article covered one‑step TD; higher‑order TD( n ) methods are explored in later installments. Actor‑critic approaches, which combine policy and value learning, are mentioned as future topics.

Reference : Sutton & Barto, Reinforcement Learning: An Introduction (2020) – http://incompleteideas.net/book/RLbook2020.pdf

reinforcement learning Monte Carlo q-learning sarsa temporal-difference maximization bias TD learning

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.