Why the Log‑Ratio Reward in OPD Is Fundamentally Flawed and Should Be Replaced

The paper reveals that the unbounded log‑ratio reward used in vanilla On‑Policy Distillation causes extreme gradient variance, early‑stage instability, and poor final performance, and demonstrates that replacing the log with a bounded Box‑Cox power transform (PowerOPD) resolves these issues while improving accuracy, efficiency, and memory usage.

Box-CoxLarge Language ModelsOPD

0 likes · 16 min read