Artificial Intelligence 11 min read

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

The article analyzes DeepSeek V4’s post‑training pipeline, explains how multi‑expert on‑policy distillation (OPD) differs from traditional teacher‑forcing, compares reverse‑KL and forward‑KL objectives, and uses analogies to human learning to illustrate the benefits and limits of OPD.

Machine Learning Algorithms & Natural Language Processing

May 1, 2026

What DeepSeek V4’s Multi‑Expert On‑Policy Distillation Reveals About Human Learning

DeepSeek V4’s post‑training pipeline first pre‑trains a base model, then fine‑tunes dozens of domain‑specific expert models (math, code, agent, instruction following, etc.) using the full SFT + GRPO reinforcement‑learning loop.

The key integration step is not classic teacher‑forcing where the student copies the teacher’s output distribution. Instead, the student generates its own rollout and multiple teachers provide per‑token feedback on that trajectory – a multi‑expert on‑policy distillation (OPD).

Previous V3.2 used an off‑policy “expert‑generated data + SFT” approach, treating expert outputs as static training data and then applying a mixed RL stage. OPD differs by replacing that stage with on‑policy feedback, eliminating the need for a separate mixed RL phase.

OPD and RL share a reverse KL objective, whereas pre‑train + SFT + traditional distillation use forward KL. Reverse KL samples trajectories from the student and aligns them with teacher feedback, which reduces catastrophic forgetting when merging specialized experts into a unified model.

An analogy: learning from a teacher who only corrects your own utterances (reverse KL) preserves your original pronunciation (mode‑seeking), while copying the teacher’s full distribution (forward KL) forces you to adopt the teacher’s accent (mode‑covering).

Empirical evidence: Qwen‑3’s report shows that distilling from a strong teacher using reverse KL outperforms pure RL in both performance and training efficiency, achieving comparable results with only one‑tenth of the GPU compute.

OPD is not universally superior; it fails when the student and teacher top‑k token distributions have low overlap, as noted in “Rethinking On‑Policy Distillation of Large Language Models.” Experiments with Qwen‑3‑4B‑GRPO as teacher and Qwen‑3‑1.7B‑Base as student confirm that shared “thinking patterns” (high overlap) are crucial.

Two practical tricks to make OPD work:

Off‑policy cold start : first perform teacher‑forced SFT on teacher‑generated rollouts to bring the student close to the teacher distribution, then apply OPD. This speeds early convergence and raises the performance ceiling.

Teacher‑aligned prompt : use prompts seen during the teacher’s post‑training, mixing in some out‑of‑distribution prompts to avoid entropy collapse.

The article also critiques popular “recursive learning” hype, arguing that LLMs merely accelerate the established reverse‑KL learning loop; without a solid forward‑KL pre‑training stage, the high‑quality feedback required for OPD cannot be interpreted.

一份推荐系统的：https://blog.recsys-frontier.com/category/推荐技术报告
一份AI 领域的：https://blog.recsys-frontier.com/category/AI技术报告

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

RLHF LLM training On‑Policy Distillation DeepSeek-V4 Reverse KL Multi-Expert Models

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.