What Advances Do GRPO, DAPO, GSPO, and SAPO Bring Over PPO?

After DPO, the typical research trajectory moves through GRPO, DAPO, GSPO, and SAPO, each introducing new optimization objectives, sampling strategies, and reward‑shaping techniques that aim to reduce memory usage, improve gradient stability, and enhance the efficiency of large‑model reinforcement learning.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
What Advances Do GRPO, DAPO, GSPO, and SAPO Bring Over PPO?

Background and Progression

Following Direct Policy Optimization (DPO), the common development path for large‑model reinforcement learning proceeds through GRPO → DAPO → GSPO → SAPO. With the theoretical foundation of PPO already established, extending these methods becomes considerably easier.

1. Improvements of GRPO over PPO

GRPO defines a new optimization objective that differs from PPO’s standard objective (the exact formula is omitted here). It adopts group sampling and a rule‑based reward function, allowing the training to load only two models—the actor and a reference model—thereby reducing GPU memory consumption for large models. The coverage of good and bad samples depends on the rollout process; with sufficient rollout samples, most cases can be covered.

Because data organization for broad case coverage is cheaper than in DPO, GRPO has become popular. However, the rollout stage remains the most time‑consuming part of training.

2. Improvements of DAPO over GRPO

DAPO introduces several concrete enhancements:

Normalization coefficient (Token‑Level Policy Gradient Loss) : Uses a global token coefficient for normalization, stabilizing gradients when group members have widely varying response lengths.

Asymmetric clipping (Clip‑Higher) : Provides finer control over policy updates.

Dynamic Sampling : Discards samples whose reward is 0 or 1 after sampling. When all samples in a group share the same reward, the advantage becomes zero, yielding no gradient update. Early training often sees all‑zero rewards, while later stages see all‑one rewards; discarding these improves rollout efficiency.

Overlong Reward Shaping : Adds a soft penalty for overly long outputs that produce zero reward, encouraging shorter, more effective generations.

3. Improvements of GSPO over GRPO

GSPO’s objective modifies the importance‑sampling coefficient: instead of using per‑token probabilities, it employs the probability of the entire sentence, which is especially beneficial for MoE models during RL training.

Because token‑level probabilities may differ between the rollout engine and the model engine, GSPO introduces a Routing Reply mechanism to ensure consistent routing decisions across both engines.

4. Improvements of SAPO over GRPO and GSPO

SAPO replaces the hard clipping operation with a soft‑control approach, offering:

Token‑level soft trust regions for finer control.

Asymmetric temperature design, applying different operations to positive and negative tokens.

References

1. DeepSeekMath (GRPO): https://arxiv.org/pdf/2402.03300
2. DAPO: https://arxiv.org/pdf/2503.14476
3. GSPO: https://arxiv.org/abs/2507.18071
4. SAPO: http://arxiv.org/pdf/2511.20347
LLMGRPOpolicy optimizationGSPODAPOSAPO
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.