Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

A recent study reveals that the widely used KL regularization in LLM reinforcement learning (RLVR) is mathematically biased, leading to unstable training and poorer generalization, and shows that moving the KL term back to the reward with a simple K1 estimator can boost out‑of‑domain performance by up to 20%.

Data Party THU
Data Party THU
Data Party THU
Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It

Background

With the rise of DeepSeek‑R1, reinforcement learning for LLMs (RLVR) has become the dominant paradigm for post‑training large models. Both PPO and the newer GRPO share the same core idea: maximize reward while constraining the policy not to deviate from a reference model.

The Hidden Choice

The crucial question is where to apply the KL penalty: should it be subtracted from the reward (in‑reward) or added as a regularization term in the loss function (in‑loss)? Most open‑source libraries such as VeRL, OpenRLHF, and SkyRL default to placing a low‑variance estimator (often called K3) directly in the loss for implementation convenience.

New Findings from Mila (Bengio Team)

The paper A Comedy of Estimators: On KL Regularization in RL Training of LLMs (arXiv:2512.21852) demonstrates that the prevailing in‑loss K3 implementation yields a biased gradient estimate. This bias not only destabilizes training but also harms the model’s ability to generalize.

The authors propose a simple fix: move the KL term back into the reward and use the naïve log‑ratio estimator (K1). This unbiased configuration improves out‑of‑domain (OOD) performance by roughly 20%.

Mathematical Analysis

In the standard RLVR objective we aim to maximize reward R while penalizing KL divergence D_{KL}(π‖π_{ref}). The KL term is usually introduced as a reverse KL regularizer with coefficient β. Because the KL cannot be computed analytically in high‑dimensional sequence space, it is estimated by sampling.

Two design dimensions arise:

Estimator : use the simple log‑ratio (K1) or the low‑variance approximation (K3) introduced by Schulman.

Placement : subtract the KL from the reward (in‑reward) or add it to the loss (in‑loss).

The analysis shows that the in‑loss K3 configuration introduces a systematic bias term, effectively optimizing a forward KL instead of the intended reverse KL, which leads to mode‑covering behavior rather than the desired mode‑seeking.

Gradient Bias Demonstrated

Figure 1 (below) visualizes the biased gradient of the K3‑in‑loss estimator compared to the unbiased K1‑in‑reward estimator.

Image
Image

A toy autoregressive model further confirms that K3‑in‑loss exhibits a noticeable bias, while K1‑in‑reward’s bias is near zero.

Image
Image

Large‑Scale Experiments

The authors fine‑tuned Qwen2.5‑7B and Llama‑3.1‑8B on the MATH benchmark and several OOD tasks (Physics, Chemistry, Biology). Results show:

Training stability : K3‑in‑reward causes exploding gradient variance and immediate collapse; K1‑in‑reward remains stable.

Generalization : K1‑in‑reward outperforms or matches K3‑in‑loss on all tasks, with an average ~19% relative gain on OOD tasks. For example, on Physics, K1‑in‑reward achieves 0.508 accuracy versus 0.429 for K3‑in‑loss.

Image
Image

Robustness in Asynchronous RL

In asynchronous architectures (e.g., DeepSeek’s GRPO), off‑policy lag can exacerbate instability. Experiments show that K1‑in‑reward remains robust, while configurations without KL or with K1‑in‑loss quickly diverge.

Image
Image

Why Unbiased Matters

Control‑variable experiments confirm that unbiased gradients are the primary driver of performance gains. When the biased K3 term is corrected to be unbiased, its performance immediately matches that of K1‑in‑reward.

Image
Image

Entropy analysis further shows that K3‑in‑loss behaves like a forward KL regularizer (mode‑covering), whereas K1‑in‑reward retains the reverse KL’s mode‑seeking property, keeping the model confident while exploring high‑reward regions.

Image
Image

Practical Guidance

For practitioners using VeRL, OpenRLHF, or similar frameworks, the recommended configuration is to set the KL estimator to k1 and enable use_kl_in_reward. This simple change can yield a “free” performance boost without additional compute.

Image
Image

Conclusion

The study warns that blindly trusting default KL configurations can impair both stability and generalization. Moving the KL penalty back to the reward and using the unbiased K1 estimator makes LLMs smarter and more reliable.

reinforcement learningAI ResearchRLHFLLM traininggradient biasKL regularizationunbiased estimator
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.