Data Party THU
Oct 1, 2025 · Artificial Intelligence
Why SFT and RL Are Two Sides of the Same Coin: A Unified Gradient Theory for LLM Post‑Training
This article analyzes a recent paper that unifies supervised fine‑tuning (SFT) and reinforcement learning (RL) for large language models under a single gradient estimator, introduces the Unified Policy Gradient Estimator (UPGE) and the Hybrid Post‑Training (HPT) algorithm, and demonstrates their superior performance on math reasoning benchmarks.
AI researchHybrid TrainingLLM
0 likes · 11 min read
