32 min read

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

On‑Policy Distillation (OPD) is widely used for post‑training large language models, but the sampled‑token variant often becomes unstable due to token‑level reward imbalance, teacher‑student signal mismatch on student‑generated prefixes, and tokenizer mismatches; this article analyses the bias‑variance trade‑off, identifies three root failure modes, and proposes a teacher‑top‑K local‑support‑set objective with top‑p rollout and special‑token masking that yields more stable training and better performance on both math and agentic benchmarks.

Machine Learning Algorithms & Natural Language Processing

Apr 14, 2026

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

Background and Motivation

On‑Policy Distillation (OPD) trains a student model on its own generated trajectories while receiving feedback from a stronger teacher model. In long‑horizon settings, the commonly used sampled‑token OPD often diverges: the gradient variance grows with the coupling strength of future rewards, and the learning signal becomes structurally unbalanced.

Bias–Variance Analysis

From a bias‑variance perspective, token‑level OPD drops the future‑reward coupling present in the sequence‑level reverse‑KL estimator. This reduces the worst‑case variance growth from quartic (sequence‑level) to quadratic (token‑level) but introduces bias. A simplified experiment shows that as the discount factor increases, the gradient variance can spike by 2–3 orders of magnitude, making optimization unstable.

Three Typical Failure Modes

Highly imbalanced positive/negative sampled‑token rewards: most sampled tokens receive negative log‑ratio rewards, causing the optimizer to over‑fit a tiny set of locally “good” tokens.

Unreliable teacher guidance on student‑generated prefixes: when the student drifts into regions rarely visited by the teacher, the teacher’s distribution remains sharp but no longer aligns with the true task objective.

Tokenizer and special‑token mismatches: different tokenizers split the same text differently, so a token that is high‑probability for the teacher may be low‑probability for the student, distorting the single‑token comparison.

Proposed Remedy: Teacher Top‑K Local Support‑Set Matching

Instead of comparing a single sampled token, we truncate the reverse‑KL to the teacher’s top‑K token set (the local support set ) and compute a distribution‑level KL on this subset. The steps are:

For each decoding prefix, collect the teacher’s top‑K tokens.

Apply support‑set re‑normalisation so that both teacher and student probabilities are renormalised within the subset.

Sample trajectories with top‑p rollout to keep prefixes in high‑probability regions of the student.

Mask special tokens (e.g., EOS variants) to avoid spurious penalties caused by tokenizer differences.

This objective retains a balanced positive/negative signal, reduces variance, and is less sensitive to tokenizer mismatches.

Implementation Details

Support‑set re‑normalisation is performed on the logits of the selected tokens before the KL computation.

Top‑p rollout uses a probability threshold (default 0.9) during training sampling.

Special‑token masking simply zeroes out the KL contribution of tokens that belong to a predefined special‑token list.

Experimental Evaluation

We implemented the method in the verl‑agent framework and evaluated on two settings:

Single‑task math (MATH500, AIME24/25, Minerva, OlympiadBench) using a 7B Qwen2.5‑Instruct student and an OpenThinker3‑7B teacher.

Multi‑task (agentic + math) where the student alternates between ALFWorld agentic episodes and the same math dataset.

Key results (pass@1 / average scores) show that the baseline sampled‑token OPD improves over the student but remains far from the teacher. Adding special‑token masking yields modest gains. Our teacher‑top‑K method with top‑p rollout and masking achieves the best scores on both tasks (e.g., 41.5 average on math and 97.7 success on ALFWorld), surpassing all baselines.

Ablation Studies

We examined the impact of each component:

Removing support‑set re‑normalisation causes training loss to collapse early.

Varying the top‑K size shows that too small a set (<5) destabilises training, while larger sets (>50) yield similar performance.

Disabling top‑p rollout dramatically increases variance and degrades final performance.

Different KL expectation definitions (teacher top‑K, student top‑K, teacher top‑K + sampled token) were compared; teacher top‑K performed best in multi‑task settings.

Discussion and Limitations

The proposed objective is still a truncated KL, so it introduces bias relative to the full‑vocabulary reverse‑KL. Moreover, a gap remains between the student and teacher when their architectures differ significantly. Training‑inference mismatch (student rollout policy vs. training distribution) may also introduce importance‑weighting issues.

Future work includes exploring top‑p truncation as an alternative to top‑K, investigating OPD’s role in continual learning, and combining our approach with other stabilization techniques such as off‑policy correction and reward‑hacking mitigation.

Conclusion

OPD’s appeal lies in aligning the student with its own trajectories under teacher supervision, but the sampled‑token variant suffers from three intertwined failure modes. Replacing the single‑token reward with a teacher‑top‑K local‑support‑set KL, together with top‑p rollout and special‑token masking, yields a more stable and effective training signal, delivering state‑of‑the‑art results on both math reasoning and agentic tasks.

英文博客: http://yuqianfu.notion.site/revisiting-opd
论文链接: https://huggingface.co/papers/2603.25562
开源代码: https://github.com/hhh675597/revisiting_opd

Gradient variance comparison in simplified experiment

large language models On-Policy Distillation training stability OPD teacher‑top‑K token‑level reward

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.