Why the Log‑Ratio Reward in OPD Is Fundamentally Flawed and Should Be Replaced

The paper reveals that the unbounded log‑ratio reward used in vanilla On‑Policy Distillation causes extreme gradient variance, early‑stage instability, and poor final performance, and demonstrates that replacing the log with a bounded Box‑Cox power transform (PowerOPD) resolves these issues while improving accuracy, efficiency, and memory usage.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Why the Log‑Ratio Reward in OPD Is Fundamentally Flawed and Should Be Replaced

Background

On‑Policy Distillation (OPD) has become a standard post‑training component for large language models. The vanilla formulation computes a token‑level reward as the teacher‑student log‑probability ratio only for sampled tokens, avoiding the full‑vocab KL but introducing a dense‑reward policy‑gradient signal.

Diagnosis of Vanilla OPD

Empirical analysis on Qwen3‑4B→Qwen3‑1.7B (MATH‑500) training curves shows three concrete pathologies of the log‑ratio reward:

Unbounded variance : reward tails reach around –50, giving rare tokens a leverage dozens of times larger than normal tokens.

Early‑rollout concentration : extreme values cluster in the first few rollout tokens, destabilising the entire generation trajectory.

Persistence : extreme positive and negative rewards do not decay over training; they appear throughout the process.

Post‑hoc fixes such as clipping, tanh, or z‑score fail because they act after the log has already amplified low‑probability differences; clipping/tanh compress too late, and z‑score can even flip the reward sign, reversing the learning direction.

Desired Reward Properties

The authors formalise two necessary properties for a sound OPD reward:

P1 – Boundedness : the reward must have finite limits to prevent catastrophic updates from rare tokens.

P2 – Sign consistency : higher teacher probability should yield positive reward, higher student probability should yield negative reward, and equality should give zero.

While the log‑ratio satisfies P2 (monotonicity), it violates P1 because it diverges near zero probability.

PowerOPD: Replacing Log with Box‑Cox

PowerOPD keeps the “transform‑then‑subtract” structure but substitutes the log with a Box‑Cox power transformation, which is monotonic and bounded. The resulting reward is naturally limited and sign‑consistent. The authors provide a single‑line formula (omitted here for brevity) that can be dropped into existing OPD pipelines without changing rollouts, teacher scoring, or the policy‑gradient framework.

Experimental Validation

Four Qwen3 teacher‑student configurations (0.6B/1.7B students × 4B/8B teachers) and six math‑reasoning benchmarks were evaluated.

PowerOPD outperforms vanilla OPD by an average of +4.47 Avg@8 / +4.06 Pass@8 across settings.

Against the strongest post‑hoc methods, PowerOPD gains +3.01 Avg@8 / +3.54 Pass@8 .

Compared with full‑vocab OPD, PowerOPD adds +2.59 Avg@8 / +8.90 Pass@8 while saving 59.2 % wall‑clock time and 23.1 % GPU memory .

Peak improvements on the AMC23 benchmark reach +16.75 Avg@8 and +15.00 Pass@8 .

Training dynamics show that PowerOPD eliminates the early accuracy dip, stabilises response length, and reduces gradient norm spikes from ~1000 to a steady 0.25‑0.35 range—a 3000× reduction.

Interpretability

Heatmaps of the reward surface illustrate that PowerOPD acts as a “probability‑region selector”: larger α values suppress low‑probability noise and focus learning on high‑probability tokens, explaining the observed shorter, more stable responses.

Efficiency Gains

Because PowerOPD only needs sampled‑token probabilities, it avoids full‑vocab KL computation. Measured resource usage:

TFLOPs per update: 402.7 → 346.6 (13.9 % reduction)

Wall‑clock time per step: 22.14 s → 9.03 s (59.2 % reduction)

Peak GPU memory: 78.99 GiB → 60.72 GiB (23.1 % reduction)

Takeaway

The instability of OPD is not an inherent on‑policy issue nor a tuning problem; it stems from using an unbounded log‑ratio as the reward. PowerOPD provides a clean, bounded replacement that yields more stable training, higher final accuracy, and substantial computational savings.

Training curves: PowerOPD vs vanilla OPD vs full‑vocab OPD
Training curves: PowerOPD vs vanilla OPD vs full‑vocab OPD
Comparison of training pathology and post‑hoc methods
Comparison of training pathology and post‑hoc methods
Reward pathology: heavy negative tail, early concentration, persistence
Reward pathology: heavy negative tail, early concentration, persistence
Heatmap of log‑ratio vs PowerOPD rewards for different α
Heatmap of log‑ratio vs PowerOPD rewards for different α
Training dynamics for different α values
Training dynamics for different α values
Gradient norm comparison: PowerOPD vs vanilla OPD
Gradient norm comparison: PowerOPD vs vanilla OPD
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language ModelsReinforcement LearningBox-Coxreward shapingtraining stabilityOPDsampled-token
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.