17 min read

SFT, DAgger, Offline RL, and OPD: Four Methods Mapped onto a Single 2×2 Grid

The paper shows that SFT, DAgger, offline RL and OPD are the four orthogonal combinations of prefix source (teacher vs. student) and KL direction (forward vs. reverse), exposing three hidden trade‑offs—KL direction, prefix source, and training length—and proposes KL‑mixing and entropy‑gated length curricula that boost Avg@k by 3.6 points, raise Pass@k by up to 5.8 points, and cut response length by three‑fold.

Machine Learning Algorithms & Natural Language Processing

Jun 16, 2026

SFT, DAgger, Offline RL, and OPD: Four Methods Mapped onto a Single 2×2 Grid

Design space of LLM distillation

The paper identifies two independent design dimensions that determine the behavior of distillation + RL pipelines:

Prefix source : teacher‑generated trajectories (off‑policy) or student‑generated rollouts (on‑policy).

KL direction : forward KL (expectation over teacher distribution) or reverse KL (expectation over student distribution).

Decoupling these dimensions yields a 2×2 matrix whose four cells correspond exactly to four classic learning paradigms:

Teacher prefix + forward KL → off‑policy supervised fine‑tuning (SFT).

Student prefix + forward KL → DAgger‑style on‑policy SFT.

Teacher prefix + reverse KL → offline RL‑style distillation.

Student prefix + reverse KL → on‑policy distillation (OPD).

Gradient‑level interpretation

Using the token‑level KL objective, the gradient of forward KL reduces to the standard cross‑entropy used in SFT, while the gradient of reverse KL becomes the REINFORCE policy‑gradient with a dense token‑level reward equal to the teacher‑student log‑ratio. Thus the KL direction alone decides whether training follows a supervised‑learning or a reinforcement‑learning update.

Experimental setup

All four configurations were evaluated on a controlled platform:

Teachers: Qwen‑3‑4B and Qwen‑3‑8B.

Student: Qwen‑3‑0.6B (same tokenizer family).

Training data: DeepScaleR.

Evaluation benchmarks: AIME24, AMC23, MATH500, GSM8K.

Training lengths: 128‑token windows and 4096‑token windows.

Evaluation modes: (i) standalone distillation, (ii) distillation followed by GRPO RL.

Key empirical findings

KL direction trade‑off (accuracy vs. entropy) : Reverse KL consistently yields higher Avg@k (≈ +2.45 → +3.68 points) but collapses predictive entropy, inflates generated length, and reduces Pass@k. Example: on MATH500 with a 128‑token window, student‑prefix + reverse KL raises Avg@k from 34.31 % to 42.65 % (4B teacher) and 43.23 % (8B teacher). However, with 4096‑token windows the same setting pushes entropy to near zero, length to the 4k limit, and Pass@k falls below the forward‑KL baseline.

Prefix source trade‑off (quality vs. compute) : Given equal training steps, student prefixes provide better supervision because the student is trained on states it will actually visit (DAgger effect). Under equal FLOP budgets, teacher prefixes are cheaper: trajectories and logits can be pre‑computed and cached, avoiding online rollout and scoring.

Training length trade‑off (accuracy vs. stability) : Longer sequences improve final accuracy but amplify entropy collapse and length explosion under reverse KL. Short sequences are stable but cap performance.

Impact on downstream RL

When the four checkpoints are handed to GRPO RL, forward‑KL checkpoints retain entropy and improve steadily, whereas reverse‑KL checkpoints start from a higher Avg@k but quickly lose entropy, limiting exploration and causing accuracy to regress. Consequently, the strongest standalone distillation target is not necessarily the best starting point for a “distillation → RL” pipeline.

Proposed remedies

KL mixing : Apply a token‑level weighted mixture of forward and reverse KL. Experiments show that a forward‑KL weight ≥ 0.8 (i.e., “sufficient forward”) preserves entropy and limits length while retaining most of the reverse‑KL accuracy gain. Higher reverse‑KL weights (≥ 0.5) still cause entropy collapse and length saturation.

Entropy‑gated length curriculum : Start with a short window (128 tokens) and monitor predicted entropy. Increase the length limit only while entropy stays above a stability threshold; freeze the limit once entropy drops. Compared with fixed 4096‑token training, this schedule yields +3.6 Avg@k, up to +5.8 Pass@k, and reduces average response length by roughly threefold.

Implementation detail

The authors implemented a fused kernel that streams the full‑vocab KL computation, reducing per‑token intermediate memory from O(|V|) to O(1). This enables exact KL evaluation (no sampling approximation) for all experiments.

Practical decision table

Pure distillation targeting Avg@k: use student prefix + reverse KL (OPD).

Distillation followed by RL: prioritize entropy preservation with forward KL or a forward‑heavy KL mix.

Compute‑constrained or when teacher trajectories are already available: use teacher prefix with cached logits.

Long‑sequence distillation: combine sufficient forward KL weight with the entropy‑gated length curriculum.

Reference

📄 论文链接：https://arxiv.org/abs/2605.16826

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

offline reinforcement learning DAgger KL divergence LLM distillation OPD entropy gating training trade-offs

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.