Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

Data Party THU
Data Party THU
Data Party THU
Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

01. A Brief Overview of PPO

PPO was initially applied to large language models (LLMs) to align model outputs with human preferences by adjusting the policy (the actor model) during reinforcement learning from human feedback (RLHF). The training pipeline involves collecting human‑ranked data, training a reward model (RM) and a value model, and then optimizing the actor model using a PPO‑style objective.

The actor loss can be expressed as:

Loss = -\mathbb{E}_{\pi_{old}}\left[\frac{\pi_{new}(a|s)}{\pi_{old}(a|s)}\hat{A}(s,a)\right]

where the advantage \(\hat{A}\) is computed using various estimators such as TD(1‑step), Monte‑Carlo, or GAE. The KL‑divergence between the current policy and a reference policy is also often added to limit policy drift.

02. GRPO (Generalized PPO)

GRPO simplifies PPO for LLMs by removing the value model and using a sequence‑level reward. The objective becomes:

GRPO objective formula
GRPO objective formula

Key issues identified in GRPO include:

Entropy collapse due to uniform advantage values across all tokens in a sequence.

Heavy reliance on a single reward function, which can be noisy or biased.

Importance‑sampling correction operates at token level while the reward is sequence‑level, causing a mismatch.

These problems motivate the subsequent algorithms.

03. DAPO (Dynamic sAmpling Policy Optimization)

DAPO, proposed by ByteDance, addresses GRPO’s shortcomings for long‑chain (Chain‑of‑Thought) outputs. Its main innovations are:

Removing KL‑divergence: For long reasoning sequences the model is expected to diverge significantly from the initial policy, so the KL constraint is unnecessary.

Clip‑Higher: The clipping range in the PPO surrogate loss is split into separate upper and lower bounds, allowing a larger upper bound to prevent over‑clipping of high‑probability tokens.

Dynamic Sampling: Batches where all sampled sequences are either completely correct or completely wrong are discarded and replaced, ensuring each batch contributes meaningful gradient information.

Token‑Level Loss Weighting: Instead of averaging advantages over whole sequences, DAPO weights each token proportionally to its sequence length, giving longer sequences less per‑token influence.

Key formulas (illustrated as images) include the modified clipping function and the dynamic penalty for length truncation:

DAPO clipping function
DAPO clipping function
Length penalty function
Length penalty function

Empirical results show that DAPO converges faster than vanilla GRPO, especially on tasks requiring extensive reasoning.

04. GSPO (Group Sequence Policy Optimization)

GSPO, introduced by the Qwen‑3 team, targets both the token‑level reward‑action mismatch and the instability of importance‑sampling in mixture‑of‑experts (MoE) models. Its core idea is to move the importance‑sampling correction from token granularity to sequence granularity.

The GSPO objective is:

GSPO objective formula
GSPO objective formula

where the importance‑sampling term is the geometric mean of token‑level ratios across the whole sequence:

Geometric mean of importance weights
Geometric mean of importance weights

This design reduces variance caused by extreme token‑level ratios, especially in MoE models where different experts are activated for old and new policies. By aggregating at the sequence level, GSPO provides a more stable gradient while still penalizing or rewarding entire outputs appropriately.

Gradient analysis shows that GSPO’s per‑token contribution becomes uniform within a sequence, effectively behaving like a cross‑entropy loss scaled by the sequence‑level advantage.

Key Takeaways

GRPO simplifies PPO for LLMs but suffers from entropy collapse, reward noise, and token‑level importance‑sampling variance.

DAPO mitigates these issues for long‑chain reasoning by removing KL constraints, expanding the clipping upper bound, dynamically sampling batches, and applying token‑level weighting.

GSPO further resolves the reward‑action granularity mismatch by moving importance‑sampling to the sequence level, which stabilizes training for both dense and MoE architectures.

LLMReinforcement learningRLHFImportance SamplingGRPOPPOGSPODAPO
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.