Jun 21, 2026 · Artificial Intelligence

Why Post‑Training Makes Large Reasoning Models Overconfident and How LED Restores Exploration

The paper reveals that reinforcement‑learning post‑training flattens the entropy of the final layer in large reasoning models, making higher sampling temperatures ineffective, and introduces Latent Exploration Decoding (LED) to recover exploration from intermediate layers, yielding consistent pass@k gains without extra training.

LED methodRL post‑trainingentropy collapse

0 likes · 13 min read

Why Post‑Training Makes Large Reasoning Models Overconfident and How LED Restores Exploration