Tagged articles

RL post‑training

2 articles · Page 1 of 1
Machine Heart
Machine Heart
Jun 21, 2026 · Artificial Intelligence

Why Post‑Training Makes Large Reasoning Models Overconfident and How LED Restores Exploration

The paper reveals that reinforcement‑learning post‑training flattens the entropy of the final layer in large reasoning models, making higher sampling temperatures ineffective, and introduces Latent Exploration Decoding (LED) to recover exploration from intermediate layers, yielding consistent pass@k gains without extra training.

LED methodRL post‑trainingentropy collapse
0 likes · 13 min read
Why Post‑Training Makes Large Reasoning Models Overconfident and How LED Restores Exploration
Machine Heart
Machine Heart
May 13, 2026 · Artificial Intelligence

Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study

Recent research by Tsinghua and collaborators dissects On‑Policy Distillation for large language models, revealing that higher‑scoring teachers often fail to improve students unless their thinking patterns align, detailing token‑level overlap dynamics, failure cases, and two practical remedies to rescue ineffective distillation.

Large Language ModelsModel ScalingOn‑Policy Distillation
0 likes · 9 min read
Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study