Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained

This article examines how reinforcement learning fine‑tuning influences large language model reasoning, revealing that RL primarily amplifies pre‑trained capabilities, suffers from entropy collapse, and fails to push the model’s reasoning boundary, supported by extensive experiments on scaling laws, entropy analysis, and mitigation techniques.

Data Party THU
Data Party THU
Data Party THU
Why RL Fine‑Tuning Fails to Extend LLM Reasoning Limits: Entropy Collapse Explained

01 RL Effect vs. Base Model

Recent works show that the ability of a model is largely determined by its pre‑training; reinforcement learning (RL) acts mainly as an "amplifier" for certain behaviors. The article questions whether a similar guiding principle exists for RL‑based reasoning models.

Relationship between RL‑based reasoning model performance and the base model.

The "entropy collapse" problem in RL.

Whether the ability boundary of reasoning models exists and how to expand it.

1.1 Observation: Base Model Dominance

Empirical observations from several papers suggest that the model’s capability is set by the base model, and RL only magnifies specific patterns.

DeepSeek R1 argues that both the base model and RL are equally important for the model’s ability boundary, stating that surpassing intelligence still requires a stronger base model and larger‑scale RL.

DeepSeek R1 results
DeepSeek R1 results

Echo Chamber trained decoder‑only models (150M and 1B parameters) with PPO, GRPO, and Expert Iteration. Findings include:

RL fine‑tuning quickly converges the output distribution to a specific pattern, suppressing other modes.

Pass@1 accuracy improves on GSM8K, but Pass@64 drops, indicating reduced output diversity.

Model size influences the dominant output style: smaller models favor code‑like outputs, larger models favor natural language.

RL‑tuned models transfer improvements to unseen datasets (MATH‑500, AIME), showing some reasoning abilities generalize.

Limit of RLVR

To evaluate the reasoning ability boundary, the authors propose the pass@k metric, which samples the model k times and counts a problem solved if any sample is correct. Experiments across math, code, and visual reasoning benchmarks with models such as Qwen‑2.5 and LLaMA‑3.1, using RL algorithms like GRPO and PPO, reveal:

RLVR models outperform base models at small k (e.g., k=1).

At larger k, base models achieve comparable or higher pass@k, indicating RLVR does not introduce new reasoning patterns.

Conclusion: RLVR in its current form cannot push LLMs beyond the base model’s reasoning ability; it improves performance for few samples but restricts exploration when many samples are allowed.

02 Entropy Collapse: From SFT to RL

The article defines information entropy for a language and extends it to token‑level entropy in LLMs. Token entropy at time step t is computed from the softmax logits and temperature. This leads to the notion of strategy entropy , quantifying uncertainty of the policy given a prompt.

Comparing supervised fine‑tuning (SFT) loss with RL loss shows that RL removes the expectation over the data distribution, focusing on sampled trajectories, which essentially mirrors the entropy reduction observed during SFT training.

2.1 Information Entropy, Strategy Entropy, and Cross‑Entropy Loss

Formulas illustrate how token entropy is derived from model logits and how strategy entropy relates to the average token entropy over a dataset.

2.2 RL Entropy‑Collapse Mechanism

Using the covariance between action probabilities and advantages, the authors prove that high‑probability actions with increasing logits reduce policy entropy. This explains the rapid entropy drop early in RL training.

2.3 Connecting RL and SFT

Two key differences between RL and SFT are the use of negative samples and sample diversity. By augmenting SFT with abundant negative samples, the authors argue SFT can become equivalent to RL, as demonstrated in the "Bridging Supervised Learning and Reinforcement Learning in Math Reasoning" paper.

03 Entropy Collapse Handling Methods

3.1 Exploration‑Exploitation Dilemma

If entropy collapses unchecked, the model over‑exploits a narrow mode, limiting capability. Excessive entropy preservation leads to over‑exploration and training instability.

3.2 Intervention Techniques

Clip‑Higher (DAPO) raises the upper clipping bound, allowing low‑probability tokens to increase, encouraging exploration.

Clip‑Higher illustration
Clip‑Higher illustration

Clip‑Cov and KL‑Cov prune high‑covariance tokens identified by the covariance between token probability and advantage, reducing their contribution to the policy gradient.

Covariance distribution
Covariance distribution

On‑Policy Training (Optimal Reward Baseline) removes the need for a rollout buffer and derives an optimal baseline that minimizes gradient variance.

score_tensor = torch.tensor(id2score[idx])
len_tensor = torch.tensor(id2len[idx])
id2bsl[idx] = (len_tensor * score_tensor).sum() / len_tensor.sum()
for i in range(bsz):
    scores[i] = scores[i] - id2bsl[index[i]]

3.3 Token‑Level Interventions

High‑entropy "fork" tokens act as decision points in chain‑of‑thought reasoning. Adjusting their entropy via methods like Clip‑Cov or entropy‑shaped advantage improves exploration without destabilizing training.

High‑entropy token analysis
High‑entropy token analysis

3.4 Shaping Advantage with Entropy

By adding an entropy‑based term to the advantage, the update magnitude can be tuned without altering the gradient flow:

adv = adv + a * torch.clamp(entropy - k, min=0)

This simple one‑line modification integrates seamlessly into existing RL pipelines and has been shown to maintain higher entropy throughout training while improving performance.

Conclusion

The article demonstrates that RL for LLM reasoning suffers from inevitable entropy collapse, which limits the ability to surpass the base model. Various mitigation strategies—clip‑based pruning, KL penalties, on‑policy baselines, and entropy‑shaped advantages—can alleviate collapse, but substantial gains still require deeper understanding of the underlying dynamics.

References

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? https://arxiv.org/abs/2504.13837

Rethinking Reflection in Pre‑Training https://arxiv.org/abs/2504.04022

Echo Chamber: RL Post‑training Amplifies Behaviors Learned in Pretraining https://arxiv.org/abs/2504.07912

DeepSeek‑R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning https://arxiv.org/abs/2501.12948

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models https://arxiv.org/pdf/2505.22617

Beyond the 80/20 Rule: High‑Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning https://arxiv.org/abs/2506.01939

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning https://arxiv.org/abs/2506.01347

Reasoning with Exploration: An Entropy Perspective https://arxiv.org/abs/2506.14758

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMentropyscalingRLRLVR
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.