Uncovering the Privilege Illusion in OPD Distillation and How DOPD Solves It

The article identifies the hidden “privilege illusion” that degrades on‑policy distillation when privileged information is injected, and introduces Dual On‑policy Distillation (DOPD), a dynamic two‑stream approach that separates true ability gaps from information gaps, achieving superior performance and stability across LLM and VLM benchmarks.

PaperAgent
PaperAgent
PaperAgent
Uncovering the Privilege Illusion in OPD Distillation and How DOPD Solves It

In the field of large‑model research, reproducing the capabilities of a large model with a smaller one remains a hot topic. On‑policy Distillation (OPD) lets a student model learn from a teacher on its own generated trajectories, mitigating distribution shift and providing dense learning signals, and has become a mainstream post‑training method.

A hidden bottleneck emerges when researchers inject "privileged information"—such as step‑by‑step reasoning hints or structured annotations—into the teacher. Instead of improving transferable ability, the student often learns shortcuts based on this privileged data, leading to poorer generalization.

Researchers from the National University of Singapore, The Chinese University of Hong Kong, Peking University and JD Exploration Institute propose Dual On‑policy Distillation (DOPD), a privilege‑aware dual‑stream framework. DOPD dynamically assigns different supervision sources, intensities, and forms to each token, directly mitigating the "privilege illusion" and substantially improving effectiveness, stability, and generalization. Experiments on both LLM and VLM benchmarks outperform existing methods.

1. The Overlooked Pitfall: Privilege Illusion in Distillation

Traditional OPD assumes that a stronger teacher always provides more learnable knowledge, encouraging the injection of privileged information (e.g., step‑wise hints for reasoning tasks or bounding boxes for vision tasks) to raise the theoretical ceiling of distillation.

However, this introduces a fatal flaw: performance gains from privileged data do not equate to transferable ability gains. The research team discovers that when the teacher possesses information unavailable to the student, the performance gap mixes two distinct components:

True ability gap : the teacher’s inherent reasoning, decision‑making, and knowledge advantages that can be distilled.

Information‑asymmetry gap : advantages solely from privileged inputs, which the student cannot acquire and thus merely fits superficial outputs.

The team names this conflation "Privilege Illusion". Distilling without distinguishing these gaps causes the student to over‑fit privileged shortcuts, leading to entropy collapse, reduced exploration, poorer generalization, and sometimes worse performance than a baseline without privileged data.

Moreover, token‑level supervision signals are inherently uneven: only a few tokens carry high‑value decision or reasoning information, while most tokens (connectors, filler words) provide little supervisory value and are easily dominated by privileged cues. Existing OPD methods treat all tokens uniformly, amplifying the negative impact of the privilege illusion.

2. DOPD: Privilege‑Aware Dual‑Stream Dynamic Distillation

Since a single supervision source and uniform intensity cannot solve the problem, DOPD introduces a dual‑stream approach: both a "privileged teacher" and a "privileged student" provide signals, and each token’s "privilege advantage gap" determines how supervision is routed.

Step 1: Distinguish Ability vs. Information via Privilege Advantage Gap

To achieve precise distillation, the method first measures, for each token, whether the teacher’s advantage stems from true ability or from privileged information. Both teacher and student receive the same privileged input, and the log‑probability difference on that token is defined as the Privilege Advantage Gap.

Large gap: even with identical privileged input, the teacher outperforms the student, indicating a genuine ability gap—high‑value learning points.

Small gap: performances are close, suggesting the teacher’s edge mainly comes from information asymmetry—limited transferable ability.

Ablation experiments show that removing tokens with large gaps causes a cliff‑drop in distillation performance, while removing low‑gap tokens has negligible effect, confirming the gap’s effectiveness in locating true ability‑bearing tokens.

Step 2: Four Token Categories, Four Customized Distillation Strategies

Building on the privilege advantage gap, DOPD also incorporates teacher and student confidence to partition tokens into four intervals, each matched with a distinct supervision source, loss function, and granularity:

High gap + teacher high confidence : core ability tokens. Apply full‑vocab Jensen‑Shannon divergence for strong teacher distillation, maximizing ability transfer.

Low gap + both high confidence : consensus tokens dominated by information gap. Use lightweight teacher distillation with top‑K reverse KL to conservatively absorb useful privileged knowledge while avoiding over‑fitting shortcuts.

High gap + student high confidence : student‑exploratory tokens. Apply top‑K reverse KL for lightweight self‑distillation, preserving student exploration while maintaining policy consistency.

Low gap + both low confidence : uncertain edge tokens. Apply a small‑weight student self‑regularization, avoiding forced imitation of uncertain teacher signals and using the privileged student as a stable anchor.

This "dual source + dynamic routing" mechanism enables three‑fold adaptation of supervision intensity, target, and granularity: high‑value tokens are learned deeply, low‑value tokens receive minimal guidance, thereby leveraging privileged information’s ability boost while mitigating its illusion.

3. Comprehensive Multi‑Scenario Validation

The team conducts extensive experiments in both LLM and VLM domains, covering general, reasoning, coding, and visual understanding tasks.

1. Primary Task Performance: Significant Gains Over All Baselines

LLM (Qwen3‑8B → Qwen3‑1.7B) : Across eight benchmarks, DOPD achieves an average score of 51.4, a 7.5‑point improvement over vanilla OPD and 4.4‑5.3 points over strong baselines such as ExOPD, Uni‑OPD, and EOPD. The teacher‑student performance gap shrinks to 89.8%, and on several hard tasks DOPD even surpasses the original teacher.

VLM (Qwen3‑VL‑8B → Qwen3‑VL‑2B) : Across eight visual benchmarks, DOPD averages 58.4, a 6.0‑point gain over vanilla OPD and 2.1‑2.8 points ahead of vision‑specific baselines. The gap reduction reaches 69.2%.

2. Scalability: Larger Teacher‑Student Gaps Amplify Benefits

Tests on five teacher‑student size pairs show DOPD’s gains are robust regardless of scale. In extreme size gaps (e.g., 8B → 0.6B), vanilla OPD yields only a 3.5‑point lift, while DOPD still adds 14.1 points, shrinking the gap by 53% and addressing the "strong teacher can’t teach tiny student" issue.

3. Comprehensive Ability: Stability, Continual Learning, Out‑of‑Distribution Generalization

Training stability : Entropy remains healthy throughout training; DOPD avoids the entropy collapse seen in self‑distillation methods, reaching the performance of other methods in only 80 steps versus 200 steps.

Continual learning : Across three learning stages, new abilities grow steadily while forgetting of old abilities is minimal, achieving true incremental capability accumulation.

OOD generalization : Transfer to unseen tasks retains a clear advantage over baselines, demonstrating stronger cross‑task robustness.

DOPD: Dual On-policy Distillation
https://arxiv.org/abs/2606.30626
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language Modelsvision-language modelsmodel distillationOPDDOPDprivilege illusion
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.