Why DQN Overestimates Q‑Values and How Double DQN Fixes It
The article explains how DQN’s use of the max operator introduces a maximization bias that leads to overestimated Q‑values, and shows how Double DQN separates action selection from value evaluation to eliminate this bias, improving stability and performance in Atari benchmarks.
Source of Maximization Bias in DQN
DQN uses a single Q‑network both to select the greedy action (via argmax_a Q(s,a)) and to evaluate its value. Because the Q‑estimates contain zero‑mean noise ε_a, the maximisation step systematically picks the highest‑biased estimate. Formally, if \hat{Q}(s,a)=Q^*(s,a)+\varepsilon_a with E[\varepsilon_a]=0, then E[\max_a \hat{Q}(s,a)] \ge \max_a Q^*(s,a). This over‑estimation propagates to the DQN target and makes the agent over‑confident in certain actions.
The bias inflates the target value y = r + \gamma \max_a Q(s',a;\theta) , leading to unstable learning.
Core Idea of Double Q‑Learning
Double Q‑learning removes the bias by decoupling action selection from value evaluation. One network selects the greedy action, while a separate network evaluates the value of that action.
Double DQN Target Formula
The Double DQN target replaces the DQN target with:
y = r + \gamma \; Q\big(s', \; \arg\max_a Q(s',a;\theta) ; \theta^{-}\big)Both networks share the same architecture but play different roles:
Online network Q(·;θ) selects the greedy action.
Target network Q(·;θ⁻) evaluates the selected action.
Training Procedure
Initialise replay buffer D, online network Q(s,a;θ) and target network Q(s,a;θ⁻).
For each episode, follow an ε‑greedy policy derived from the online network to generate transitions (s,a,r,s') and store them in D.
Sample a minibatch of transitions from D. For each sample compute the Double DQN target y as defined above and minimise the squared loss (y - Q(s,a;θ))² with stochastic gradient descent.
Every C steps update the target network parameters: θ⁻ ← θ.
Numerical Example of Over‑estimation
Consider a next state where the online network predicts:
Q(s',a₁)=5.0 Q(s',a₂)=6.0The true optimal values are Q⁎(s',a₁)=5.0 and Q⁎(s',a₂)=5.5 . With reward r=1 and discount γ=0.9 : DQN target : y_DQN = 1 + 0.9 × 6.0 = 6.4 True target : y_true = 1 + 0.9 × 5.5 = 5.95 The bias is 6.4 − 5.95 = 0.45 , causing the agent to over‑value action a₂ . How Double DQN Corrects the Bias The online network selects the greedy action a₂ ( argmax_a Q(s',a;θ) ). The target network evaluates this action, e.g., Q(s',a₂;θ⁻)=5.5 . The Double DQN target is computed as y = 1 + 0.9 × 5.5 = 5.95 , exactly matching the true target and eliminating the over‑estimation. Conclusion Separating action selection and value evaluation removes the maximisation bias inherent in the original DQN update. Double DQN yields more accurate target values, improves training stability, and achieves significantly higher scores on benchmark tasks such as Atari games.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
