Artificial Intelligence 8 min read

What Does On-Policy Distillation Really Teach Large Language Models?

On-Policy Distillation (OPD) trains large language models by letting the student generate its own inference paths while the teacher supplies token‑level guidance, offering denser signals than RL but sometimes failing when teacher and student reasoning diverge, as detailed by THUNLP’s recent study.

Network Intelligence Research Center (NIRC)

May 25, 2026

What Does On-Policy Distillation Really Teach Large Language Models?

On-Policy Distillation Overview

On-Policy Distillation (OPD) has become a popular post‑training technique for large language models. Unlike traditional offline distillation, OPD does not ask the student to imitate pre‑generated teacher answers; instead, the student generates its own inference trajectories and the teacher provides token‑level supervision on those trajectories.

Intuition and Comparison

The intuition is straightforward: wherever the student makes a mistake, the teacher corrects it, and training follows the student‑generated trajectories. Compared with offline distillation, OPD aligns more closely with the model’s real inference process; compared with reinforcement learning, it can provide denser and more stable token‑level training signals.

When OPD Fails

OPD is not universally effective. If the student generates states that lie outside the teacher’s familiar reasoning paths, the teacher’s guidance on those states may be ineffective. Experiments show that even a stronger teacher does not always improve the student, and sometimes the student’s performance can even degrade.

Study by THUNLP

The THUNLP paper Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe investigates three questions: when is OPD effective, what does it actually learn, and how to fix failures.

Basic Mechanism

OPD’s key distinction from offline distillation is that the training distribution is determined by the student itself. For each prefix the student generates, the teacher supplies the probability distribution of the next token. This reduces the mismatch between training and inference distributions because the student always sees states it will actually encounter.

Root Cause of Failure

The failure root is that if the student’s generated states deviate from the teacher’s expertise, the teacher’s token‑level guidance may no longer be valid.

Experimental Findings

Experiments compared a student model distilled with two teachers: JustRL‑1.5B (smaller) and R1‑Distill‑7B (larger). Although the 7B teacher is stronger, the student only improved consistently with the 1.5B teacher. The authors attribute this to three token‑level dynamic metrics:

Overlap Ratio – overlap of high‑probability tokens between teacher and student.

Overlap‑Token Advantage – probability gap on overlapping tokens.

Absolute Entropy Gap – local confidence difference.

In successful OPD runs, all three metrics improve steadily; in failed runs they stagnate, indicating incompatibility between teacher reasoning and student‑visited states.

Where Effective Gradients Come From

The authors split tokens into two groups: Overlap Top‑k tokens (both teacher and student assign high probability) and Non‑overlap Top‑k tokens (teacher deems important but student does not). Optimizing only the overlap tokens yields performance close to full OPD, while focusing on non‑overlap tokens performs markedly worse. This shows OPD calibrates token probabilities in regions where the student already aligns with the teacher, rather than learning uniformly over the whole vocabulary.

Conditions for Success

Two conditions are required for OPD to succeed:

Teacher and student must share compatible reasoning patterns; otherwise the teacher’s token distribution cannot effectively guide the student.

The teacher must provide genuine information gain. A higher‑scoring teacher that produces a token distribution very similar to the student’s adds little new knowledge.

Practical Recommendations

A strong teacher is not automatically a good teacher; assess whether the teacher can deliver useful signals on the student’s trajectories.

To mitigate OPD failures, the paper suggests two practical directions:

Off‑policy cold start: pre‑train the student with offline data or teacher‑generated data until it acquires a basic reasoning pattern, then switch to OPD.

Align teacher prompts: ensure the teacher and student use the same prompt format and reasoning template, as OPD is sensitive to prompt mismatches.

Token‑level rewards are not a panacea; for long‑chain reasoning, code generation, or complex tool‑use tasks, success often hinges on a few critical steps that token‑level alignment alone may not capture.

In summary, OPD’s effectiveness depends less on the teacher’s overall strength and more on the overlap of high‑probability tokens and the teacher’s ability to provide true information gain on the student’s own trajectories.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models model alignment Post-Training on-policy distillation Distillation Metrics Token-level Supervision

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.