Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 14, 2026 · Artificial Intelligence

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

On‑Policy Distillation (OPD) is widely used for post‑training large language models, but the sampled‑token variant often becomes unstable due to token‑level reward imbalance, teacher‑student signal mismatch on student‑generated prefixes, and tokenizer mismatches; this article analyses the bias‑variance trade‑off, identifies three root failure modes, and proposes a teacher‑top‑K local‑support‑set objective with top‑p rollout and special‑token masking that yields more stable training and better performance on both math and agentic benchmarks.

OPDOn-Policy Distillationlarge language models
0 likes · 32 min read
Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix
Machine Heart
Machine Heart
Apr 14, 2026 · Artificial Intelligence

Why Binary Success Rate Is Obsolete: Introducing PRM-as-a-Judge for Dense Evaluation of Embodied Tasks

The article critiques binary success rate for long‑horizon robotic tasks, proposes the PRM-as-a-Judge framework with a potential‑based progress signal and the three‑layer OPD metric suite, validates it on the RoboPulse benchmark, and shows how it yields fine‑grained, diagnostic insights into policy performance.

OPDRoboPulsedense metrics
0 likes · 20 min read
Why Binary Success Rate Is Obsolete: Introducing PRM-as-a-Judge for Dense Evaluation of Embodied Tasks