Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study
Recent research by Tsinghua and collaborators dissects On‑Policy Distillation for large language models, revealing that higher‑scoring teachers often fail to improve students unless their thinking patterns align, detailing token‑level overlap dynamics, failure cases, and two practical remedies to rescue ineffective distillation.
Phenomenon: Larger Teachers Do Not Guarantee Better Students
On‑Policy Distillation (OPD) is widely adopted for post‑training of large language models because it supplies dense token‑level supervision. However, experiments show that replacing a teacher with a stronger one often yields no performance gain or even regression.
Law 1 – Thinking‑Pattern Consistency
A weak base model Qwen3‑1.7B‑Base was distilled from two teachers of similar capability: Qwen3‑4B (Non‑thinking) and Qwen3‑4B‑Base‑GRPO. The teacher trained with GRPO shares a thinking pattern with the base student, resulting in a higher initial Overlap Ratio and markedly better final performance. The authors note that early mismatches in thinking patterns are difficult to compensate later.
Law 2 – Higher Scores ≠ New Knowledge
Across the DeepSeek and Qwen families, scaling the teacher size while keeping the same pipeline and data provides limited benefit. For example:
In the DeepSeek family, the RL‑enhanced teacher Skywork‑OR1‑Math‑7B recovers 16.9 % of the teacher‑student gap, whereas a larger but otherwise identical teacher DeepSeek‑R1‑Distill‑7B recovers only 5.3 % .
In the Qwen family, the gap recovery difference reaches 58.6 % versus 15.6 % for a larger teacher.
These results indicate that a larger teacher under the same pipeline often represents merely a higher‑capacity version of the same distribution, offering little new transferable signal.
Extreme Reverse‑Distillation Experiment
A reverse‑distillation test used a post‑RL student JustRL‑1.5B as the learner. The student was asked to learn from its pre‑RL checkpoint R1‑Distill‑1.5B and, as a control, from a larger R1‑Distill‑7B teacher. Both directions caused the student’s performance to regress to the pre‑RL level, and the regression curves were nearly identical. This demonstrates that parameter count alone does not provide additional useful information for OPD.
Mechanism: Token‑Level Overlap Drives Success
Dynamic monitoring of training revealed a clear pattern:
Successful OPD shows the Overlap Ratio of the top‑k predicted tokens rising from ~ 72 % to > 91 % while the Entropy Gap shrinks rapidly.
Failed OPD exhibits flat Overlap Ratio and Entropy Gap throughout training.
Ablation experiments that computed loss only on the overlapping tokens retained almost the full performance gain, whereas non‑overlapping tokens contributed virtually nothing to the gradient.
Prescriptions for Recovering Failing Distillation
Cold‑Start via Off‑Policy Rollouts – Before OPD, perform a lightweight supervised fine‑tuning on teacher‑generated rollouts. This raises the initial Overlap Ratio, enabling smoother OPD convergence and higher final performance.
Teacher‑Aligned Prompts – Use prompts that match the teacher’s post‑training distribution in both template and content. This further accelerates Overlap growth and accuracy, but because it speeds entropy collapse, mixing in some out‑of‑distribution prompts is advisable to avoid premature collapse.
Scaling Limits and Reflections
Dense token‑level rewards decay sharply with generation depth. In 15 K‑token responses, entropy collapse occurs toward the end, turning teacher rewards into noise and causing training collapse. Moreover, a globally informative reward does not guarantee a locally optimizable gradient: failed teachers still achieve high AUROC in distinguishing correct versus incorrect rollouts, indicating that the reward landscape is flat locally despite containing global information.
Conclusion
The decisive factors for effective OPD are alignment of thinking patterns between teacher and student and the provision of high‑overlap token signals, not merely teacher size or score.
Paper: https://arxiv.org/abs/2604.13016
Code: https://github.com/thunlp/OPD
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
