What Does On-Policy Distillation Really Teach Large Language Models?

On-Policy Distillation (OPD) trains large language models by letting the student generate its own inference paths while the teacher supplies token‑level guidance, offering denser signals than RL but sometimes failing when teacher and student reasoning diverge, as detailed by THUNLP’s recent study.

Distillation MetricsPost-TrainingToken-level Supervision

0 likes · 8 min read

What Does On-Policy Distillation Really Teach Large Language Models?