Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study

Recent research by Tsinghua and collaborators dissects On‑Policy Distillation for large language models, revealing that higher‑scoring teachers often fail to improve students unless their thinking patterns align, detailing token‑level overlap dynamics, failure cases, and two practical remedies to rescue ineffective distillation.

Machine Heart
Machine Heart
Machine Heart
Why Bigger Teachers Don’t Teach Better: Tsinghua’s On‑Policy Distillation Study

Phenomenon: Larger Teachers Do Not Guarantee Better Students

On‑Policy Distillation (OPD) is widely adopted for post‑training of large language models because it supplies dense token‑level supervision. However, experiments show that replacing a teacher with a stronger one often yields no performance gain or even regression.

Law 1 – Thinking‑Pattern Consistency

A weak base model Qwen3‑1.7B‑Base was distilled from two teachers of similar capability: Qwen3‑4B (Non‑thinking) and Qwen3‑4B‑Base‑GRPO. The teacher trained with GRPO shares a thinking pattern with the base student, resulting in a higher initial Overlap Ratio and markedly better final performance. The authors note that early mismatches in thinking patterns are difficult to compensate later.

Law 2 – Higher Scores ≠ New Knowledge

Across the DeepSeek and Qwen families, scaling the teacher size while keeping the same pipeline and data provides limited benefit. For example:

In the DeepSeek family, the RL‑enhanced teacher Skywork‑OR1‑Math‑7B recovers 16.9 % of the teacher‑student gap, whereas a larger but otherwise identical teacher DeepSeek‑R1‑Distill‑7B recovers only 5.3 % .

In the Qwen family, the gap recovery difference reaches 58.6 % versus 15.6 % for a larger teacher.

These results indicate that a larger teacher under the same pipeline often represents merely a higher‑capacity version of the same distribution, offering little new transferable signal.

Extreme Reverse‑Distillation Experiment

A reverse‑distillation test used a post‑RL student JustRL‑1.5B as the learner. The student was asked to learn from its pre‑RL checkpoint R1‑Distill‑1.5B and, as a control, from a larger R1‑Distill‑7B teacher. Both directions caused the student’s performance to regress to the pre‑RL level, and the regression curves were nearly identical. This demonstrates that parameter count alone does not provide additional useful information for OPD.

Mechanism: Token‑Level Overlap Drives Success

Dynamic monitoring of training revealed a clear pattern:

Successful OPD shows the Overlap Ratio of the top‑k predicted tokens rising from ~ 72 % to > 91 % while the Entropy Gap shrinks rapidly.

Failed OPD exhibits flat Overlap Ratio and Entropy Gap throughout training.

Ablation experiments that computed loss only on the overlapping tokens retained almost the full performance gain, whereas non‑overlapping tokens contributed virtually nothing to the gradient.

Prescriptions for Recovering Failing Distillation

Cold‑Start via Off‑Policy Rollouts – Before OPD, perform a lightweight supervised fine‑tuning on teacher‑generated rollouts. This raises the initial Overlap Ratio, enabling smoother OPD convergence and higher final performance.

Teacher‑Aligned Prompts – Use prompts that match the teacher’s post‑training distribution in both template and content. This further accelerates Overlap growth and accuracy, but because it speeds entropy collapse, mixing in some out‑of‑distribution prompts is advisable to avoid premature collapse.

Scaling Limits and Reflections

Dense token‑level rewards decay sharply with generation depth. In 15 K‑token responses, entropy collapse occurs toward the end, turning teacher rewards into noise and causing training collapse. Moreover, a globally informative reward does not guarantee a locally optimizable gradient: failed teachers still achieve high AUROC in distinguishing correct versus incorrect rollouts, indicating that the reward landscape is flat locally despite containing global information.

Conclusion

The decisive factors for effective OPD are alignment of thinking patterns between teacher and student and the provision of high‑overlap token signals, not merely teacher size or score.

Paper: https://arxiv.org/abs/2604.13016

Code: https://github.com/thunlp/OPD

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large-language-modelsModel ScalingOn-Policy DistillationRL Post-TrainingTeacher-Student AlignmentToken Overlap
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.