On-Policy Distillation — 4 Technical Articles

Machine Learning Algorithms & Natural Language Processing

Apr 14, 2026 · Artificial Intelligence

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

On‑Policy Distillation (OPD) is widely used for post‑training large language models, but the sampled‑token variant often becomes unstable due to token‑level reward imbalance, teacher‑student signal mismatch on student‑generated prefixes, and tokenizer mismatches; this article analyses the bias‑variance trade‑off, identifies three root failure modes, and proposes a teacher‑top‑K local‑support‑set objective with top‑p rollout and special‑token masking that yields more stable training and better performance on both math and agentic benchmarks.

OPDOn-Policy Distillationlarge language models

0 likes · 32 min read

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

Machine Learning Algorithms & Natural Language Processing

Feb 22, 2026 · Artificial Intelligence

What Is On-Policy Distillation? A Deep Dive into On-Policy and Self-Distillation

The article explains On-Policy Distillation, derives its forward and reverse KL gradients, introduces Self‑Distillation where the policy serves as its own teacher, discusses practical implementation tricks such as extra‑knowledge injection, EMA or trust‑region teacher stabilization, and highlights benefits like reduced catastrophic forgetting, fewer Aha moments, and a narrower train‑test gap, especially for larger models.

Catastrophic ForgettingEMAKL Divergence

0 likes · 6 min read

What Is On-Policy Distillation? A Deep Dive into On-Policy and Self-Distillation

HyperAI Super Neural

Jan 9, 2026 · Artificial Intelligence

How HY-MT1.5 Achieves 1 GB Mobile Translation with a 1.8B Model

The article explains how Tencent's open‑source HY‑MT1.5 tackles the high‑cost, large‑parameter barrier of neural machine translation by offering a 1.8 B‑parameter model that runs on roughly 1 GB of RAM, processes 50 tokens in 0.18 s, supports 33 languages, and uses on‑policy distillation to retain top‑tier accuracy, while providing a step‑by‑step online demo and free compute credits for new users.

HY-MT1.5On-Policy DistillationTencent

0 likes · 5 min read

How HY-MT1.5 Achieves 1 GB Mobile Translation with a 1.8B Model

DataFunTalk

Oct 30, 2025 · Artificial Intelligence

How On-Policy Distillation Cuts LLM Training Cost by 90%

Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.

Model EfficiencyOn-Policy Distillationknowledge distillation

0 likes · 15 min read

How On-Policy Distillation Cuts LLM Training Cost by 90%

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

What Is On-Policy Distillation? A Deep Dive into On-Policy and Self-Distillation

How HY-MT1.5 Achieves 1 GB Mobile Translation with a 1.8B Model

How On-Policy Distillation Cuts LLM Training Cost by 90%

How HY-MT1.5 Achieves 1 GB Mobile Translation with a 1.8B Model