Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

The article analyzes the poor generalization of supervised fine‑tuning (SFT) for large language models, reveals its gradient as a high‑variance inverse‑probability policy gradient, proposes a one‑line Dynamic Fine‑Tuning correction, and shows substantial gains on challenging math and offline RL benchmarks.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Standard SFT Fails to Generalize and How One‑Line Dynamic Fine‑Tuning Fixes It

Background: The SFT Generalization Dilemma

In large‑language‑model alignment, Supervised Fine‑Tuning (SFT) is the quickest way to teach a model to imitate expert answers, but practitioners observe that SFT merely memorizes (“backs the questions”) and struggles to generalize, unlike reinforcement learning (RL) which can extrapolate.

Key Insight: SFT as a Pathological RL Special Case

By unifying SFT and RL under a common mathematical framework, the authors show that the standard SFT gradient is equivalent to a policy‑gradient estimator with an inverse‑probability weight . When the model assigns low confidence to an expert token, this weight explodes, causing huge gradient variance. Consequently, SFT implicitly rewards sparse, high‑variance signals, leading to over‑fitting on low‑probability tokens and poor generalization.

SFT’s implicit reward is “sparse + high variance”, making it structurally prone to over‑fitting.

Solution: Dynamic Fine‑Tuning (DFT) – Cancel the Inverse‑Probability Weight

The remedy is to multiply the loss by a stop‑gradient term that removes the inverse‑probability factor. At the token level, the loss becomes:

loss = (probs.detach() * F.cross_entropy(logits, target, reduction='none')).mean()

Here sg denotes stop‑gradient, preventing the weight itself from receiving gradients. This single‑line change flattens the high variance caused by the inverse probability, effectively assigning a uniform reward of 1 to all expert tokens.

Experimental Results: One Line, Across‑The‑Board Wins

The authors evaluated five state‑of‑the‑art models (e.g., Qwen2.5‑Math, LLaMA‑3.x, DeepSeekMath) on five difficult math benchmarks (Math500, Olympiad Bench, AIME 2024, etc.).

Larger gains on harder tasks: On AIME 2024, SFT degrades Qwen2.5‑Math‑7B from 6.68 to 2.48, while DFT lifts it to 8.56.

Faster convergence: DFT reaches the accuracy of 300‑step SFT in only ~100 steps.

Offline RL scenario: When transplanted to a reject‑sampling + reward offline RL setting, DFT still outperforms DPO, RFT, PPO, GRPO by 3–11 points on average.

Experiment results chart
Experiment results chart

Deeper Phenomenon: Polarized Token Distributions

Visualization of token probability distributions shows that SFT pushes all tokens toward probability 1, including many meaningless function words. DFT, by contrast, sharpens important tokens while suppressing filler tokens, creating a “polarized” distribution that concentrates probability mass on mathematically meaningful tokens.

Token distribution comparison
Token distribution comparison

Conclusion & Outlook

By expressing SFT gradients as a policy gradient, identifying the inverse‑probability “disease”, and removing it with a single line of code, DFT brings SFT generalization close to or beyond RL.

Theoretical contribution: First rigorous proof that SFT is a special case of RL with a pathological reward structure.

Engineering contribution: DFT requires no extra models, data, or hyper‑parameters, making it community‑friendly.

Limitations: Experiments are limited to math tasks and models ≤7B; further validation on code, commonsense reasoning, larger models, and multimodal settings is needed.

If you are working on instruction tuning, mathematical reasoning, or offline RL, replace the standard cross‑entropy loss line loss = F.cross_entropy(...) with loss = (probs.detach() * F.cross_entropy(...)) and you may gain several points of performance with just one line of code.

machine learningreinforcement learningSupervised Fine‑TuningGeneralizationLLM alignmentDynamic Fine-Tuning
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.