Artificial Intelligence 15 min read

From PPO to SAPO: Evolution of Large‑Model Reinforcement Learning Algorithms

This article systematically reviews the main reinforcement‑learning algorithms—PPO, GRPO, DAPO, GSPO, and SAPO—used for fine‑tuning large language models, explaining why supervised fine‑tuning precedes RL, how each method improves training efficiency and stability, and what trade‑offs they entail.

Baobao Algorithm Notes

Jan 16, 2026

From PPO to SAPO: Evolution of Large‑Model Reinforcement Learning Algorithms

PPO (Proximal Policy Optimization)

PPO is a classic on‑policy RL algorithm for large‑model fine‑tuning. It samples trajectories with the current (old) policy, computes a sequence‑level advantage, and updates the new policy under a hard‑clipping constraint to keep the importance‑sampling ratio close to 1. The clipped surrogate objective is

L^{CLIP}(\theta)=\mathbb{E}_t\big[\min\big(r_t(\theta)A_t,\;\text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A_t\big)\big]
\text{where } r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}

Generalized Advantage Estimation (GAE) is used to compute token‑level advantage:

A_t = \sum_{l=0}^{\infty}(\gamma\lambda)^l\delta_{t+l},\quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

PPO training loop:

Rollout: generate responses for a batch of prompts.

Reward: compute a sequence‑level reward (via a reward model or rule‑based scoring).

Value estimation: a value head predicts the expected future reward for each token.

Advantage: compute token‑level advantage with GAE.

Update value head.

Update policy using the clipped surrogate loss.

GRPO (Group Relative Policy Optimization)

GRPO removes the value model entirely. For each prompt it samples G trajectories, computes their rewards, and averages them to obtain an empirical advantage that is shared by all tokens in the response. The KL‑divergence term from PPO is retained to prevent reward hacking.

GRPO objective (simplified):

L^{GRPO}(\theta)=\mathbb{E}_{\tau\sim\pi_{old}}\big[\text{KL}(\pi_{old}\|\pi_\theta) - \bar{R}\,A_{emp}\big]
\text{where } \bar{R}=\frac{1}{G}\sum_{i=1}^{G}R(\tau_i)

Key benefits:

No value head → lower memory and compute cost.

Advantage estimation no longer depends on a potentially unstable value model.

DAPO (Dynamic Sampling Policy Optimization)

DAPO builds on GRPO and introduces four engineering tricks to improve stability and efficiency for long‑sequence generation.

Clip Higher : the upper clipping bound is increased (e.g., from 1+ε to 1+β, β>ε) to avoid entropy collapse and keep exploration alive.

Dynamic Sampling : trajectories whose reward yields zero advantage (too easy or impossible) are filtered out before the gradient step, focusing learning on informative samples.

Token‑Level Policy Gradient Loss : instead of assigning the sequence‑level advantage uniformly to each token (which weakens the signal for long responses), DAPO weights every token equally in the loss, i.e.,

L^{token}= -\frac{1}{T}\sum_{t=1}^{T} A_{seq}\,\log \pi_\theta(a_t|s_t)

Overlong Reward Shaping : responses longer than a predefined limit are masked out of the loss, and a length‑aware penalty is added to the reward:

R' = R - \lambda\,\max(0,\;|response|-L_{max})

GSPO (Group Sequence Policy Optimization)

GSPO lifts the importance‑sampling ratio and advantage from token level to the whole sequence, which greatly stabilizes training for mixture‑of‑experts (MoE) models where token‑level ratios can fluctuate dramatically.

Sequence‑level importance ratio:

r_{seq}=\frac{\prod_{t=1}^{T}\pi_\theta(a_t|s_t)}{\prod_{t=1}^{T}\pi_{\theta_{old}}(a_t|s_t)}

Sequence‑level advantage is the average of the sampled trajectory rewards minus a baseline. The GSPO surrogate loss keeps the same clipping form but applies it to r_seq:

L^{GSPO}=\mathbb{E}\big[\min(r_{seq}A_{seq},\;\text{clip}(r_{seq},1-\epsilon,1+\epsilon)A_{seq})\big]

Advantages:

Gradient direction is no longer distorted by token‑level ratio spikes; the ratio only scales gradient magnitude.

Empirically, GSPO converges faster on MoE models such as Qwen3‑30B‑A3B‑Base.

SAPO (Soft Adaptive Policy Optimization)

SAPO refines GSPO by replacing the hard clipping with a smooth soft‑gate and by applying asymmetric temperature scaling to positive and negative advantages.

Soft gate : a sigmoid‑shaped gate replaces the hard clip, e.g., g(r)=\frac{1}{1+\exp(-\alpha (r-1))} The surrogate loss becomes L^{SAPO}=\mathbb{E}\big[g(r_{seq})\,A_{seq}\big] Asymmetric temperature scaling : positive advantages are multiplied by a temperature τ⁺ and negative advantages by τ⁻ (τ⁻<τ⁺) to dampen the destabilizing effect of large negative‑advantage tokens.

\tilde{A}=\begin{cases} A/\tau^+ & A>0 \\ A/\tau^- & A<0 \end{cases}

When the training satisfies (A1) small‑step/on‑policy (i.e., r_seq≈1) and (A2) low intra‑sequence discreteness, SAPO behaves like GSPO; otherwise it gracefully degrades to the more robust GRPO behavior.

Overall Evolution

The five algorithms form a progressive chain for RL‑based fine‑tuning of large language models:

PPO : baseline with hard clipping and a value head.

GRPO : eliminates the value head, using empirical advantage from multiple sampled trajectories.

DAPO : adds higher clipping, dynamic sampling, token‑level loss, and length‑aware reward shaping to improve efficiency on long sequences.

GSPO : moves importance ratio and advantage to the sequence level, stabilizing training especially for MoE architectures.

SAPO : introduces a soft gate and asymmetric temperature scaling, providing a smooth transition between GSPO‑style on‑policy updates and the more robust GRPO behavior.

large language models model fine-tuning GRPO PPO RL SAPO

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.