OPD — 2 Technical Articles

Machine Learning Algorithms & Natural Language Processing

Apr 14, 2026 · Artificial Intelligence

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

On‑Policy Distillation (OPD) is widely used for post‑training large language models, but the sampled‑token variant often becomes unstable due to token‑level reward imbalance, teacher‑student signal mismatch on student‑generated prefixes, and tokenizer mismatches; this article analyses the bias‑variance trade‑off, identifies three root failure modes, and proposes a teacher‑top‑K local‑support‑set objective with top‑p rollout and special‑token masking that yields more stable training and better performance on both math and agentic benchmarks.

OPDOn-Policy Distillationlarge language models

0 likes · 32 min read

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

Machine Heart

Apr 14, 2026 · Artificial Intelligence

Why Binary Success Rate Is Obsolete: Introducing PRM-as-a-Judge for Dense Evaluation of Embodied Tasks

The article critiques binary success rate for long‑horizon robotic tasks, proposes the PRM-as-a-Judge framework with a potential‑based progress signal and the three‑layer OPD metric suite, validates it on the RoboPulse benchmark, and shows how it yields fine‑grained, diagnostic insights into policy performance.

OPDRoboPulsedense metrics

0 likes · 20 min read

Why Binary Success Rate Is Obsolete: Introducing PRM-as-a-Judge for Dense Evaluation of Embodied Tasks