Machine Heart
Jun 21, 2026 · Artificial Intelligence
Why Post‑Training Makes Large Reasoning Models Overconfident and How LED Restores Exploration
The paper reveals that reinforcement‑learning post‑training flattens the entropy of the final layer in large reasoning models, making higher sampling temperatures ineffective, and introduces Latent Exploration Decoding (LED) to recover exploration from intermediate layers, yielding consistent pass@k gains without extra training.
LED methodRL post‑trainingentropy collapse
0 likes · 13 min read
