Tagged articles
1 articles
Page 1 of 1
Data Party THU
Data Party THU
Oct 1, 2025 · Artificial Intelligence

Why SFT and RL Are Two Sides of the Same Coin: A Unified Gradient Theory for LLM Post‑Training

This article analyzes a recent paper that unifies supervised fine‑tuning (SFT) and reinforcement learning (RL) for large language models under a single gradient estimator, introduces the Unified Policy Gradient Estimator (UPGE) and the Hybrid Post‑Training (HPT) algorithm, and demonstrates their superior performance on math reasoning benchmarks.

AI researchHybrid TrainingLLM
0 likes · 11 min read
Why SFT and RL Are Two Sides of the Same Coin: A Unified Gradient Theory for LLM Post‑Training