Artificial Intelligence 11 min read

Why Entropy Collapse Limits LLM Reinforcement Learning and How to Fix It

The article explains how information entropy, cross‑entropy, and KL‑divergence shape reinforcement learning for large language models, describes the phenomenon of entropy collapse, compares token‑level and policy‑level entropy, and reviews recent methods like Clip‑Cov and KL‑Cov that mitigate this issue.

Baobao Algorithm Notes

Oct 28, 2025

Why Entropy Collapse Limits LLM Reinforcement Learning and How to Fix It

Information Entropy, Cross‑Entropy, and KL‑Divergence

For a discrete random variable X with probability mass function p(x), the Shannon entropy is H(X) = - \sum_x p(x) \log p(x) It quantifies the expected “surprise” of the distribution. In language models the token‑level entropy at generation step t measures how confident the model is: low entropy indicates a peaked distribution, high entropy indicates many plausible continuations.

Cross‑entropy replaces the true probability p(x) with the model’s predicted probability q(x): H(P, Q) = - \sum_x p(x) \log q(x) The excess cost over the optimal encoding is the Kullback‑Leibler (KL) divergence:

D_{KL}(P\|Q) = \sum_x p(x) \log \frac{p(x)}{q(x)}

Minimizing cross‑entropy is equivalent to minimizing D_{KL}(P\|Q) because the true entropy H(P) is a constant.

Policy Entropy vs. Token‑Level Entropy

Two recent arXiv papers (2505.22617 and 2506.01939) study entropy at different scales:

Policy entropy (macro level) – the overall entropy of the RL policy distribution during training.

Token‑level entropy (micro level) – the entropy of the token probability distribution, used as a diagnostic to locate high‑impact “forking” tokens in reasoning chains.

Entropy Collapse in Reinforcement Learning

During early RL fine‑tuning, the policy entropy can drop sharply – a phenomenon called entropy collapse . The model becomes over‑confident in a few high‑probability actions, suppressing exploration and causing performance to saturate. Empirically, reward gains are exchanged for entropy consumption; once entropy is exhausted, further reward improvement ceases. The collapse is linked to a positive covariance between action probabilities and their advantage values. High‑probability, high‑advantage actions are reinforced, which further increases their probability and reduces overall entropy.

Mitigation Techniques

Clip‑Cov : randomly select a small subset of tokens with positive covariance and block their gradient updates, preventing them from dominating the distribution.

KL‑Cov : add a KL‑regularization term to the top‑k tokens with the highest covariance, encouraging the model to retain diversity.

Illustration of entropy collapse during RL training

Experimental Comparison on Qwen‑Series Models

Three training strategies were evaluated:

Full update : standard RL updating gradients for all tokens.

High‑entropy‑only update : compute and apply gradients only for the ~20 % of tokens with the highest entropy.

Low‑entropy‑only update : compute and apply gradients only for the ~80 % low‑entropy tokens.

Results:

Updating only high‑entropy tokens achieves comparable or better performance than full updates.

Updating only low‑entropy tokens leads to a steep performance drop.

Performance comparison of different update strategies

Key Takeaways

RL for reasoning primarily improves the ~20 % high‑entropy “critical minority” tokens; the remaining low‑entropy tokens are already well‑learned during supervised fine‑tuning.

Uncontrolled RL leads to entropy collapse, causing early saturation and loss of exploration.

Techniques that limit updates on high‑covariance tokens (Clip‑Cov, KL‑Cov) preserve entropy and maintain performance.

Balancing diversity (entropy) and accuracy is essential: controlling entropy during RL prevents collapse and enables large language models to make better decisions at uncertain, high‑impact points. Relevant papers: <code>https://arxiv.org/abs/2505.22617 https://arxiv.org/abs/2506.01939</code>

entropy cross-entropy policy entropy token entropy

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.