Why SFT and RL Are Two Sides of the Same Coin: A Unified Gradient Theory for LLM Post‑Training

This article analyzes a recent paper that unifies supervised fine‑tuning (SFT) and reinforcement learning (RL) for large language models under a single gradient estimator, introduces the Unified Policy Gradient Estimator (UPGE) and the Hybrid Post‑Training (HPT) algorithm, and demonstrates their superior performance on math reasoning benchmarks.

Data Party THU
Data Party THU
Data Party THU
Why SFT and RL Are Two Sides of the Same Coin: A Unified Gradient Theory for LLM Post‑Training

Background and Motivation

Large language models (LLMs) have achieved impressive capabilities, but improving logical reasoning and problem‑solving remains a key challenge. Two dominant post‑training paradigms are supervised fine‑tuning (SFT), which learns directly from human‑written answers, and reinforcement learning (RL), which lets the model explore solutions and adjust based on reward signals. SFT is stable but prone to over‑fitting, while RL offers strong exploration but can diverge early without a solid prior.

Unified Policy Gradient Estimator (UPGE)

The authors prove that SFT and RL optimize the same underlying objective when viewed at the gradient level. They derive a compact unified gradient form:

Gradient = StabilityMask × ImportanceWeight × Advantage × PolicyDirection

All existing methods (SFT, PPO, GRPO, SRFT, etc.) can be expressed with this formula; they differ only in how each component is instantiated.

Core Components

Stability Mask : limits excessively large updates (e.g., PPO clipping) to keep training stable.

Reference Policy Denominator (Importance Weight) : adjusts token‑wise importance. SFT uses the current policy to up‑weight low‑probability tokens, PPO uses the old policy to bound changes, and offline RL often fixes the denominator to reduce variance at the cost of bias.

Advantage Estimator : scores answer sequences. In SFT all samples are treated as positive; RL uses normalized rewards (e.g., GRPO intra‑batch normalization).

Likelihood Gradient : the actual parameter‑update term shared by all methods.

These components embody a bias‑variance trade‑off: SFT has low variance but high bias, RL has low bias but high variance, and offline RL reduces variance while introducing additional bias.

Hybrid Post‑Training (HPT) Algorithm

Building on UPGE, HPT lets the model decide dynamically whether to follow SFT (imitation) or RL (exploration) for each training instance.

Dynamic Switching Mechanism

For a given question, generate multiple answers and compute the correctness rate.

If the rate is low, apply SFT to learn from demonstrations.

If the rate is high, switch to RL to explore alternative solutions.

The switching threshold can be a binary value (e.g., 0/1), eliminating the need for manually tuned mixing coefficients.

Loss Composition

RL component uses the DR‑GRPO loss with clipping and normalized advantage.

SFT component uses standard cross‑entropy.

The total loss is the sum of the two, weighted implicitly by the dynamic switch.

Experimental Setup

Models : Qwen2.5‑Math‑1.5B, Qwen2.5‑Math‑7B, and LLaMA3.1‑8B.

Benchmarks : Six math‑reasoning datasets (AIME, MATH, etc.) plus two out‑of‑distribution sets (ARC‑c, GPQA).

Baselines : SFT, GRPO, LUFFY, SRFT, and sequential SFT→GRPO pipelines.

Results

HPT consistently outperforms all baselines across models and datasets, especially on out‑of‑distribution generalization. For example, on Qwen2.5‑Math‑7B, HPT improves the AIME2024 score by 7 percentage points over the strongest baseline. The method achieves higher Pass@1024 (better exploration) while solving more hard problems without forgetting previously learned ones.

Ablation Studies

Offline RL methods (e.g., LUFFY) are less effective than high‑quality demonstration data, highlighting the importance of good SFT data.

The best trade‑off is obtained when SFT is triggered only on complete failure; overly aggressive SFT suppresses exploration.

Discussion and Future Work

UPGE provides a unified lens to view SFT, online RL, and offline RL as gradient estimators of the same objective under different data‑distribution assumptions. HPT demonstrates that a simple, self‑adaptive switching strategy can harness the strengths of both paradigms.

Future directions include finer‑grained component composition strategies, extending the framework to multimodal and cross‑task scenarios, and deeper theoretical analysis of bias‑variance trade‑offs in complex LLM training.

Illustration
Illustration
Diagram of SFT vs RL
Diagram of SFT vs RL
Gradient forms of SFT, PPO, GRPO, SRFT
Gradient forms of SFT, PPO, GRPO, SRFT
Performance comparison chart
Performance comparison chart
Pass@1024 vs baseline
Pass@1024 vs baseline
Ablation: offline RL vs SFT
Ablation: offline RL vs SFT
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMAI researchSupervised Fine‑TuningHybrid TrainingUnified Gradient
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.