Data Party THU
Jan 7, 2026 · Artificial Intelligence
Why the Common KL Penalty in LLM RL Training Is Biased—and How to Fix It
A recent study reveals that the widely used KL regularization in LLM reinforcement learning (RLVR) is mathematically biased, leading to unstable training and poorer generalization, and shows that moving the KL term back to the reward with a simple K1 estimator can boost out‑of‑domain performance by up to 20%.
AI researchKL regularizationLLM training
0 likes · 10 min read
