Understanding Temporal‑Difference Algorithms in Reinforcement Learning
This tutorial explains temporal‑difference (TD) learning, compares it with dynamic programming and Monte‑Carlo methods, walks through concrete soccer‑match examples, shows one‑step TD versus constant‑α Monte‑Carlo updates, discusses convergence, bias, and introduces popular TD variants such as Sarsa, Q‑learning, Expected Sarsa and double learning.
Reinforcement learning (RL) studies how an agent can learn optimal policies by interacting with a complex environment and receiving rewards. The article begins by recalling that dynamic programming (DP) updates state values using other states’ estimates (bootstrapping) and that Monte‑Carlo (MC) methods wait until an episode ends before updating.
Temporal‑Difference (TD) Learning
TD combines the strengths of DP and MC: it updates a state after a fixed number n of time steps, without needing a model of the environment. The simplest case, TD(1), updates the state immediately after the next step (one‑step TD). The TD error is defined as δ = R + γ·V(s') – V(s), where γ is the discount factor.
“The TD error measures the difference between the observed reward plus the discounted next‑state value and the current estimate.”
Because TD updates instantly, it converges faster than MC, especially when episodes are long.
Illustrative Soccer Example
The article uses a simplified Copa América tournament to compare constant‑α MC and one‑step TD. Initial value estimates for each team are shown as yellow points; the algorithm’s predictions are the yellow line, while actual cumulative goal differences are the black line. After each match, TD updates the prediction using the observed reward (goal difference) and the change in value between consecutive states.
Initial value for Chile: v = 1.2 Cumulative reward after the third match: G = 3 TD update step: Δ = α·(G – v) = 0.5·(3 – 1.2) = 0.9 New value: v = 1.2 + 0.9 = 2.1 For the fourth match (Chile vs. Ecuador), the article computes the TD error as δ = R – (v[t+1] – v[t]) = 5 – 0.6 = 4.4, multiplies by the learning rate α = 0.5 to obtain the update β = 2.2, and updates the state value to v = 1.5 + 2.2 = 3.7.
Algorithm Comparison
Monte‑Carlo adjusts the estimate toward the total episode return, while one‑step TD uses bootstrapping to incorporate the next‑state value and immediate reward, leading to faster adaptation. The article notes that TD methods theoretically guarantee convergence to the correct value function.
State‑update strategies differ: MC updates only after an episode ends, whereas one‑step TD updates immediately after each reward, which is advantageous for long episodes.
TD Variants
The tutorial introduces three widely used TD‑based control algorithms:
Sarsa – an on‑policy method that updates the action‑value Q(s,a) using the next state‑action pair (s',a') observed under the current policy.
Q‑learning – an off‑policy method that replaces the next‑state action value with the maximum over all actions, i.e., max_a Q(s',a), yielding faster learning in many problems.
Expected Sarsa – computes the expected next‑state value under the current policy, reducing variance at the cost of extra computation.
All three share the same TD update formula but differ in how the target value is estimated.
Maximization Bias
When the max operator is applied to noisy value estimates, it can produce an upward bias, especially early in learning when Q‑values are randomly initialized. The article reproduces the classic Sutton‑Barto example where a state with two actions leads the agent to over‑estimate the left action’s value, causing sub‑optimal greedy choices.
Double Learning
To mitigate maximization bias, double learning maintains two independent Q‑functions ( Q₁ and Q₂). The action with the highest value according to Q₁ is selected, but its value is evaluated using Q₂, providing an unbiased estimate. The article walks through a maze example, showing how the update equations are applied step by step.
Conclusion
Temporal‑difference methods are among the most widely used techniques in modern RL and also appear in other prediction tasks such as time‑series forecasting. The article covered one‑step TD; higher‑order TD( n ) methods are explored in later installments. Actor‑critic approaches, which combine policy and value learning, are mentioned as future topics.
Reference : Sutton & Barto, Reinforcement Learning: An Introduction (2020) – http://incompleteideas.net/book/RLbook2020.pdf
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
