Why DQN Overestimates Q‑Values and How Double DQN Fixes It

The article explains how DQN’s use of the max operator introduces a maximization bias that leads to overestimated Q‑values, and shows how Double DQN separates action selection from value evaluation to eliminate this bias, improving stability and performance in Atari benchmarks.

Data Party THU
Data Party THU
Data Party THU
Why DQN Overestimates Q‑Values and How Double DQN Fixes It

Source of Maximization Bias in DQN

DQN uses a single Q‑network both to select the greedy action (via argmax_a Q(s,a)) and to evaluate its value. Because the Q‑estimates contain zero‑mean noise ε_a, the maximisation step systematically picks the highest‑biased estimate. Formally, if \hat{Q}(s,a)=Q^*(s,a)+\varepsilon_a with E[\varepsilon_a]=0, then E[\max_a \hat{Q}(s,a)] \ge \max_a Q^*(s,a). This over‑estimation propagates to the DQN target and makes the agent over‑confident in certain actions.

maximization bias illustration
maximization bias illustration

The bias inflates the target value y = r + \gamma \max_a Q(s',a;\theta) , leading to unstable learning.

Core Idea of Double Q‑Learning

Double Q‑learning removes the bias by decoupling action selection from value evaluation. One network selects the greedy action, while a separate network evaluates the value of that action.

Double DQN Target Formula

The Double DQN target replaces the DQN target with:

y = r + \gamma \; Q\big(s', \; \arg\max_a Q(s',a;\theta) ; \theta^{-}\big)

Both networks share the same architecture but play different roles:

Online network Q(·;θ) selects the greedy action.

Target network Q(·;θ⁻) evaluates the selected action.

Training Procedure

Initialise replay buffer D, online network Q(s,a;θ) and target network Q(s,a;θ⁻).

For each episode, follow an ε‑greedy policy derived from the online network to generate transitions (s,a,r,s') and store them in D.

Sample a minibatch of transitions from D. For each sample compute the Double DQN target y as defined above and minimise the squared loss (y - Q(s,a;θ))² with stochastic gradient descent.

Every C steps update the target network parameters: θ⁻ ← θ.

training loop
training loop

Numerical Example of Over‑estimation

Consider a next state where the online network predicts:

Q(s',a₁)=5.0
Q(s',a₂)=6.0

The true optimal values are Q⁎(s',a₁)=5.0 and Q⁎(s',a₂)=5.5 . With reward r=1 and discount γ=0.9 : DQN target : y_DQN = 1 + 0.9 × 6.0 = 6.4 True target : y_true = 1 + 0.9 × 5.5 = 5.95 The bias is 6.4 − 5.95 = 0.45 , causing the agent to over‑value action a₂ . How Double DQN Corrects the Bias The online network selects the greedy action a₂ ( argmax_a Q(s',a;θ) ). The target network evaluates this action, e.g., Q(s',a₂;θ⁻)=5.5 . The Double DQN target is computed as y = 1 + 0.9 × 5.5 = 5.95 , exactly matching the true target and eliminating the over‑estimation. Conclusion Separating action selection and value evaluation removes the maximisation bias inherent in the original DQN update. Double DQN yields more accurate target values, improves training stability, and achieves significantly higher scores on benchmark tasks such as Atari games.

algorithm analysisreinforcement learningDQNdeep Q‑learningDouble DQNmaximization bias
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.