Network Intelligence Research Center (NIRC)
May 25, 2026 · Artificial Intelligence
What Does On-Policy Distillation Really Teach Large Language Models?
On-Policy Distillation (OPD) trains large language models by letting the student generate its own inference paths while the teacher supplies token‑level guidance, offering denser signals than RL but sometimes failing when teacher and student reasoning diverge, as detailed by THUNLP’s recent study.
Distillation MetricsModel AlignmentOn-Policy Distillation
0 likes · 8 min read
