Why Post‑Training Makes Large Reasoning Models Overconfident and How LED Restores Exploration

The paper reveals that reinforcement‑learning post‑training flattens the entropy of the final layer in large reasoning models, making higher sampling temperatures ineffective, and introduces Latent Exploration Decoding (LED) to recover exploration from intermediate layers, yielding consistent pass@k gains without extra training.

Machine Heart
Machine Heart
Machine Heart
Why Post‑Training Makes Large Reasoning Models Overconfident and How LED Restores Exploration

Problem Discovery: Entropy Collapse in the Final Layer after RL Post‑Training

Recent large reasoning models (LRMs) such as OpenAI o1, DeepSeek‑R1, Qwen3, and MiMo achieve strong results, largely thanks to two engines: <think>...</think> long‑chain reasoning and reinforcement‑learning post‑training (e.g., GRPO) that raise pass@1. However, increasing decoding temperature no longer improves pass@n; the models become more "confident" in a single sample but lose the ability to discover new solutions across multiple attempts, lowering the true capability ceiling.

Key Observation: Entropy Preserved in Intermediate Layers

Analysis of early‑generation LLMs shows a positive accuracy‑temperature slope (alpha), but for the latest LRMs alpha approaches zero or becomes negative, indicating temperature no longer stimulates exploration. The authors explain that GRPO rewards only whole‑output correctness, pushing a few critical token branches toward near‑one‑hot distributions, causing entropy collapse in the final layer while intermediate layers retain higher entropy due to Transformer residual connections. This hidden entropy acts as a reservoir for potential exploration.

Proposed Method: Latent Exploration Decoding (LED)

LED restores exploration without additional training by aggregating probability distributions from intermediate layers. Three challenges are solved:

Avoiding noise in the vocabulary: Top‑k coverage analysis shows the final layer’s top‑1 probability exceeds 90 % and top‑2 exceeds 99 %, while intermediate layers distribute probability more smoothly over these candidates. LED therefore restricts exploration to the final‑layer‑approved top‑k tokens.

Cross‑layer information aggregation: Instead of weighting each layer, LED cumulatively sums distributions from the final layer upward, computes entropy for each cumulative distribution, and selects the one with highest entropy as the exploration distribution, eliminating hyper‑parameters.

Balancing exploration and exploitation: The final‑layer top‑1 probability serves as a confidence measure; if it is very high, LED follows conventional decoding (exploitation), otherwise it activates the latent exploration distribution (exploration). No thresholds are required.

LED limits this mechanism to the reasoning (thinking) phase, automatically reverting to standard decoding during answer generation to avoid disturbing the final answer.

Experiments: Consistent Pass@k Improvements

Evaluations on six benchmarks (GSM8K, MATH‑500, AIME 2024/2025, GPQA‑Diamond, LiveCodeBench v5) across five models (Qwen3‑4B‑Thinking, MiMo‑7B‑RL, Qwen3‑30B‑A3B‑Thinking, QwQ‑32B, DeepSeek‑R1‑Distill‑Llama‑8B) show LED raises average pass@1 from 77.4 % to 78.0 % and pass@16 from 88.8 % to 89.7 %. Compared with strong baselines (DoLa, SoftThinking, SoftThinking‑Gumbel), LED leads or matches them while keeping generation length unchanged (e.g., 12,269 vs 12,277 tokens for Qwen3‑4B‑Thinking). Throughput remains at ~91.8 % of standard decoding under 8×H100, 16K context, batch‑size 128.

Temperature curves demonstrate that LED flips the previously negative alpha back to positive for all recent LRMs, restoring the usefulness of temperature scaling.

Ablation studies confirm that removing "explore only during thinking" drops pass@1 by 0.58 points, removing the "exploit" branch causes a 14.7‑point collapse and 33 % length inflation, and omitting top‑k filtering leads to degenerate loops.

Why Exploration Matters for RL Training

Online RL algorithms like GRPO generate multiple trajectories per step; insufficient exploration during generation yields tiny advantage signals. Integrating LED into GRPO rollouts improves downstream accuracy from 41.99 % to 43.10 % (testing with standard decoding) and up to 45.44 % when LED is used both in training and testing. Moreover, LED reduces average rollout length by 10 % and training time from 4.87 h to 4.44 h.

Conclusion

LED identifies that RL post‑training compresses entropy in the final layer, while intermediate layers retain it. By extracting, filtering, and aggregating this latent entropy, LED restores exploration without extra parameters, architecture changes, or significant compute overhead, delivering stable pass@k gains across diverse models and benchmarks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large reasoning modelsRL post‑trainingentropy collapselatent exploration decodingLED methodpass@k improvement
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.