Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained
This article examines how reinforcement learning fine‑tuning influences large language model reasoning, revealing that RL primarily amplifies pre‑trained capabilities, suffers from entropy collapse, and fails to push the model’s reasoning boundary, supported by extensive experiments on scaling laws, entropy analysis, and mitigation techniques.
01 RL Effect vs. Base Model
Recent works show that the ability of a model is largely determined by its pre‑training; reinforcement learning (RL) acts mainly as an "amplifier" for certain behaviors. The article questions whether a similar guiding principle exists for RL‑based reasoning models.
Relationship between RL‑based reasoning model performance and the base model.
The "entropy collapse" problem in RL.
Whether the ability boundary of reasoning models exists and how to expand it.
1.1 Observation: Base Model Dominance
Empirical observations from several papers suggest that the model’s capability is set by the base model, and RL only magnifies specific patterns.
DeepSeek R1 argues that both the base model and RL are equally important for the model’s ability boundary, stating that surpassing intelligence still requires a stronger base model and larger‑scale RL.
Echo Chamber trained decoder‑only models (150M and 1B parameters) with PPO, GRPO, and Expert Iteration. Findings include:
RL fine‑tuning quickly converges the output distribution to a specific pattern, suppressing other modes.
Pass@1 accuracy improves on GSM8K, but Pass@64 drops, indicating reduced output diversity.
Model size influences the dominant output style: smaller models favor code‑like outputs, larger models favor natural language.
RL‑tuned models transfer improvements to unseen datasets (MATH‑500, AIME), showing some reasoning abilities generalize.
Limit of RLVR
To evaluate the reasoning ability boundary, the authors propose the pass@k metric, which samples the model k times and counts a problem solved if any sample is correct. Experiments across math, code, and visual reasoning benchmarks with models such as Qwen‑2.5 and LLaMA‑3.1, using RL algorithms like GRPO and PPO, reveal:
RLVR models outperform base models at small k (e.g., k=1).
At larger k, base models achieve comparable or higher pass@k, indicating RLVR does not introduce new reasoning patterns.
Conclusion: RLVR in its current form cannot push LLMs beyond the base model’s reasoning ability; it improves performance for few samples but restricts exploration when many samples are allowed.
02 Entropy Collapse: From SFT to RL
The article defines information entropy for a language and extends it to token‑level entropy in LLMs. Token entropy at time step t is computed from the softmax logits and temperature. This leads to the notion of strategy entropy , quantifying uncertainty of the policy given a prompt.
Comparing supervised fine‑tuning (SFT) loss with RL loss shows that RL removes the expectation over the data distribution, focusing on sampled trajectories, which essentially mirrors the entropy reduction observed during SFT training.
2.1 Information Entropy, Strategy Entropy, and Cross‑Entropy Loss
Formulas illustrate how token entropy is derived from model logits and how strategy entropy relates to the average token entropy over a dataset.
2.2 RL Entropy‑Collapse Mechanism
Using the covariance between action probabilities and advantages, the authors prove that high‑probability actions with increasing logits reduce policy entropy. This explains the rapid entropy drop early in RL training.
2.3 Connecting RL and SFT
Two key differences between RL and SFT are the use of negative samples and sample diversity. By augmenting SFT with abundant negative samples, the authors argue SFT can become equivalent to RL, as demonstrated in the "Bridging Supervised Learning and Reinforcement Learning in Math Reasoning" paper.
03 Entropy Collapse Handling Methods
3.1 Exploration‑Exploitation Dilemma
If entropy collapses unchecked, the model over‑exploits a narrow mode, limiting capability. Excessive entropy preservation leads to over‑exploration and training instability.
3.2 Intervention Techniques
Clip‑Higher (DAPO) raises the upper clipping bound, allowing low‑probability tokens to increase, encouraging exploration.
Clip‑Cov and KL‑Cov prune high‑covariance tokens identified by the covariance between token probability and advantage, reducing their contribution to the policy gradient.
On‑Policy Training (Optimal Reward Baseline) removes the need for a rollout buffer and derives an optimal baseline that minimizes gradient variance.
score_tensor = torch.tensor(id2score[idx])
len_tensor = torch.tensor(id2len[idx])
id2bsl[idx] = (len_tensor * score_tensor).sum() / len_tensor.sum()
for i in range(bsz):
scores[i] = scores[i] - id2bsl[index[i]]3.3 Token‑Level Interventions
High‑entropy "fork" tokens act as decision points in chain‑of‑thought reasoning. Adjusting their entropy via methods like Clip‑Cov or entropy‑shaped advantage improves exploration without destabilizing training.
3.4 Shaping Advantage with Entropy
By adding an entropy‑based term to the advantage, the update magnitude can be tuned without altering the gradient flow:
adv = adv + a * torch.clamp(entropy - k, min=0)This simple one‑line modification integrates seamlessly into existing RL pipelines and has been shown to maintain higher entropy throughout training while improving performance.
Conclusion
The article demonstrates that RL for LLM reasoning suffers from inevitable entropy collapse, which limits the ability to surpass the base model. Various mitigation strategies—clip‑based pruning, KL penalties, on‑policy baselines, and entropy‑shaped advantages—can alleviate collapse, but substantial gains still require deeper understanding of the underlying dynamics.
References
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? https://arxiv.org/abs/2504.13837
Rethinking Reflection in Pre‑Training https://arxiv.org/abs/2504.04022
Echo Chamber: RL Post‑training Amplifies Behaviors Learned in Pretraining https://arxiv.org/abs/2504.07912
DeepSeek‑R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning https://arxiv.org/abs/2501.12948
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models https://arxiv.org/pdf/2505.22617
Beyond the 80/20 Rule: High‑Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning https://arxiv.org/abs/2506.01939
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning https://arxiv.org/abs/2506.01347
Reasoning with Exploration: An Entropy Perspective https://arxiv.org/abs/2506.14758
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
