SFT, DAgger, Offline RL, and OPD: Four Methods Mapped onto a Single 2×2 Grid

The paper shows that SFT, DAgger, offline RL and OPD are the four orthogonal combinations of prefix source (teacher vs. student) and KL direction (forward vs. reverse), exposing three hidden trade‑offs—KL direction, prefix source, and training length—and proposes KL‑mixing and entropy‑gated length curricula that boost Avg@k by 3.6 points, raise Pass@k by up to 5.8 points, and cut response length by three‑fold.

DAggerKL divergenceLLM distillation

0 likes · 17 min read

SFT, DAgger, Offline RL, and OPD: Four Methods Mapped onto a Single 2×2 Grid

training trade-offs

SFT, DAgger, Offline RL, and OPD: Four Methods Mapped onto a Single 2×2 Grid