Why SFT and RL Are Two Sides of the Same Coin: A Unified Gradient Theory for LLM Post‑Training
This article analyzes a recent paper that unifies supervised fine‑tuning (SFT) and reinforcement learning (RL) for large language models under a single gradient estimator, introduces the Unified Policy Gradient Estimator (UPGE) and the Hybrid Post‑Training (HPT) algorithm, and demonstrates their superior performance on math reasoning benchmarks.
Background and Motivation
Large language models (LLMs) have achieved impressive capabilities, but improving logical reasoning and problem‑solving remains a key challenge. Two dominant post‑training paradigms are supervised fine‑tuning (SFT), which learns directly from human‑written answers, and reinforcement learning (RL), which lets the model explore solutions and adjust based on reward signals. SFT is stable but prone to over‑fitting, while RL offers strong exploration but can diverge early without a solid prior.
Unified Policy Gradient Estimator (UPGE)
The authors prove that SFT and RL optimize the same underlying objective when viewed at the gradient level. They derive a compact unified gradient form:
Gradient = StabilityMask × ImportanceWeight × Advantage × PolicyDirectionAll existing methods (SFT, PPO, GRPO, SRFT, etc.) can be expressed with this formula; they differ only in how each component is instantiated.
Core Components
Stability Mask : limits excessively large updates (e.g., PPO clipping) to keep training stable.
Reference Policy Denominator (Importance Weight) : adjusts token‑wise importance. SFT uses the current policy to up‑weight low‑probability tokens, PPO uses the old policy to bound changes, and offline RL often fixes the denominator to reduce variance at the cost of bias.
Advantage Estimator : scores answer sequences. In SFT all samples are treated as positive; RL uses normalized rewards (e.g., GRPO intra‑batch normalization).
Likelihood Gradient : the actual parameter‑update term shared by all methods.
These components embody a bias‑variance trade‑off: SFT has low variance but high bias, RL has low bias but high variance, and offline RL reduces variance while introducing additional bias.
Hybrid Post‑Training (HPT) Algorithm
Building on UPGE, HPT lets the model decide dynamically whether to follow SFT (imitation) or RL (exploration) for each training instance.
Dynamic Switching Mechanism
For a given question, generate multiple answers and compute the correctness rate.
If the rate is low, apply SFT to learn from demonstrations.
If the rate is high, switch to RL to explore alternative solutions.
The switching threshold can be a binary value (e.g., 0/1), eliminating the need for manually tuned mixing coefficients.
Loss Composition
RL component uses the DR‑GRPO loss with clipping and normalized advantage.
SFT component uses standard cross‑entropy.
The total loss is the sum of the two, weighted implicitly by the dynamic switch.
Experimental Setup
Models : Qwen2.5‑Math‑1.5B, Qwen2.5‑Math‑7B, and LLaMA3.1‑8B.
Benchmarks : Six math‑reasoning datasets (AIME, MATH, etc.) plus two out‑of‑distribution sets (ARC‑c, GPQA).
Baselines : SFT, GRPO, LUFFY, SRFT, and sequential SFT→GRPO pipelines.
Results
HPT consistently outperforms all baselines across models and datasets, especially on out‑of‑distribution generalization. For example, on Qwen2.5‑Math‑7B, HPT improves the AIME2024 score by 7 percentage points over the strongest baseline. The method achieves higher Pass@1024 (better exploration) while solving more hard problems without forgetting previously learned ones.
Ablation Studies
Offline RL methods (e.g., LUFFY) are less effective than high‑quality demonstration data, highlighting the importance of good SFT data.
The best trade‑off is obtained when SFT is triggered only on complete failure; overly aggressive SFT suppresses exploration.
Discussion and Future Work
UPGE provides a unified lens to view SFT, online RL, and offline RL as gradient estimators of the same objective under different data‑distribution assumptions. HPT demonstrates that a simple, self‑adaptive switching strategy can harness the strengths of both paradigms.
Future directions include finer‑grained component composition strategies, extending the framework to multimodal and cross‑task scenarios, and deeper theoretical analysis of bias‑variance trade‑offs in complex LLM training.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
