What Is On-Policy Distillation? A Deep Dive into On-Policy and Self-Distillation
The article explains On-Policy Distillation, derives its forward and reverse KL gradients, introduces Self‑Distillation where the policy serves as its own teacher, discusses practical implementation tricks such as extra‑knowledge injection, EMA or trust‑region teacher stabilization, and highlights benefits like reduced catastrophic forgetting, fewer Aha moments, and a narrower train‑test gap, especially for larger models.
1. Goal and KL Objective of On-Policy Distillation
On-Policy Distillation (OPD) minimizes the KL divergence between a student policy and a teacher policy evaluated on trajectories sampled from the student’s own distribution. Both reverse KL and forward KL formulations appear in the literature; references [1‑2] use reverse KL, while [3] adopts forward KL.
2. Gradient Forms
For the forward KL the gradient can be derived as … (the article shows the explicit expression). For the reverse KL a similar derivation yields … . The author notes that these gradients resemble standard reinforcement‑learning objectives, differing only in the weighting term – RL weights by reward or advantage, whereas OPD weights by the KL term.
3. On-Policy Self‑Distillation
Self‑Distillation (SD) treats the policy itself as the teacher. The key components are a stop‑gradient operator and an extra‑knowledge term, both of which can be obtained from the policy’s own outputs.
4. Implementation Details
4.1 Introducing Extra Knowledge
Method 1: Directly feed ground‑truth information to the policy as a reference.
Method 2: Derive the extra knowledge from environment feedback.
4.2 Determining Teacher Parameters
Using a frozen policy works in early training but collapses later.
Keeping the teacher identical to the current policy is possible but performs worse than an EMA‑smoothed teacher.
Trust‑region updates produce a teacher stability comparable to EMA, preventing rapid parameter drift.
Trust‑region update rules are illustrated in the following diagrams:
5. Advantages of On-Policy Self‑Distillation
5.1 Mitigating Catastrophic Forgetting
Self‑Distillation bridges the distribution gap during language‑model fine‑tuning, as shown in earlier work (Self‑Distilled Reasoner).
5.2 Reducing “Aha” Moments
5.3 Narrowing Train‑Test Gap
Exposing the student to the test‑time distribution during training reduces exposure bias.
6. Scaling Self‑Distillation
Empirically, larger models gain more over GRPO because in‑context learning ability improves with scale.
Reference
[1] Reinforcement Learning via Self‑Distillation
https://arxiv.org/html/2601.20802
[2] Self‑Distillation Enables Continual Learning
https://arxiv.org/html/2601.19897
[3] Self‑Distilled Reasoner: On‑Policy Self‑Distillation for Large Language Models
https://arxiv.org/html/2601.18734Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
