Oct 1, 2025 · Artificial Intelligence

Why SFT and RL Are Two Sides of the Same Coin: A Unified Gradient Theory for LLM Post‑Training

This article analyzes a recent paper that unifies supervised fine‑tuning (SFT) and reinforcement learning (RL) for large language models under a single gradient estimator, introduces the Unified Policy Gradient Estimator (UPGE) and the Hybrid Post‑Training (HPT) algorithm, and demonstrates their superior performance on math reasoning benchmarks.

AI researchHybrid TrainingLLM

0 likes · 11 min read

Why SFT and RL Are Two Sides of the Same Coin: A Unified Gradient Theory for LLM Post‑Training