What Is On-Policy Distillation? A Deep Dive into On-Policy and Self-Distillation

The article explains On-Policy Distillation, derives its forward and reverse KL gradients, introduces Self‑Distillation where the policy serves as its own teacher, discusses practical implementation tricks such as extra‑knowledge injection, EMA or trust‑region teacher stabilization, and highlights benefits like reduced catastrophic forgetting, fewer Aha moments, and a narrower train‑test gap, especially for larger models.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
What Is On-Policy Distillation? A Deep Dive into On-Policy and Self-Distillation

1. Goal and KL Objective of On-Policy Distillation

On-Policy Distillation (OPD) minimizes the KL divergence between a student policy and a teacher policy evaluated on trajectories sampled from the student’s own distribution. Both reverse KL and forward KL formulations appear in the literature; references [1‑2] use reverse KL, while [3] adopts forward KL.

2. Gradient Forms

For the forward KL the gradient can be derived as … (the article shows the explicit expression). For the reverse KL a similar derivation yields … . The author notes that these gradients resemble standard reinforcement‑learning objectives, differing only in the weighting term – RL weights by reward or advantage, whereas OPD weights by the KL term.

3. On-Policy Self‑Distillation

Self‑Distillation (SD) treats the policy itself as the teacher. The key components are a stop‑gradient operator and an extra‑knowledge term, both of which can be obtained from the policy’s own outputs.

4. Implementation Details

4.1 Introducing Extra Knowledge

Method 1: Directly feed ground‑truth information to the policy as a reference.

Method 2: Derive the extra knowledge from environment feedback.

4.2 Determining Teacher Parameters

Using a frozen policy works in early training but collapses later.

Keeping the teacher identical to the current policy is possible but performs worse than an EMA‑smoothed teacher.

Trust‑region updates produce a teacher stability comparable to EMA, preventing rapid parameter drift.

Trust‑region update rules are illustrated in the following diagrams:

5. Advantages of On-Policy Self‑Distillation

5.1 Mitigating Catastrophic Forgetting

Self‑Distillation bridges the distribution gap during language‑model fine‑tuning, as shown in earlier work (Self‑Distilled Reasoner).

5.2 Reducing “Aha” Moments

5.3 Narrowing Train‑Test Gap

Exposing the student to the test‑time distribution during training reduces exposure bias.

6. Scaling Self‑Distillation

Empirically, larger models gain more over GRPO because in‑context learning ability improves with scale.

Reference

[1] Reinforcement Learning via Self‑Distillation
https://arxiv.org/html/2601.20802
[2] Self‑Distillation Enables Continual Learning
https://arxiv.org/html/2601.19897
[3] Self‑Distilled Reasoner: On‑Policy Self‑Distillation for Large Language Models
https://arxiv.org/html/2601.18734
KL DivergenceSelf-DistillationOn-Policy DistillationCatastrophic ForgettingEMATrust Region
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.