Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive
This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.
01. A Brief Overview of PPO
PPO was initially applied to large language models (LLMs) to align model outputs with human preferences by adjusting the policy (the actor model) during reinforcement learning from human feedback (RLHF). The training pipeline involves collecting human‑ranked data, training a reward model (RM) and a value model, and then optimizing the actor model using a PPO‑style objective.
The actor loss can be expressed as:
Loss = -\mathbb{E}_{\pi_{old}}\left[\frac{\pi_{new}(a|s)}{\pi_{old}(a|s)}\hat{A}(s,a)\right]where the advantage \(\hat{A}\) is computed using various estimators such as TD(1‑step), Monte‑Carlo, or GAE. The KL‑divergence between the current policy and a reference policy is also often added to limit policy drift.
02. GRPO (Generalized PPO)
GRPO simplifies PPO for LLMs by removing the value model and using a sequence‑level reward. The objective becomes:
Key issues identified in GRPO include:
Entropy collapse due to uniform advantage values across all tokens in a sequence.
Heavy reliance on a single reward function, which can be noisy or biased.
Importance‑sampling correction operates at token level while the reward is sequence‑level, causing a mismatch.
These problems motivate the subsequent algorithms.
03. DAPO (Dynamic sAmpling Policy Optimization)
DAPO, proposed by ByteDance, addresses GRPO’s shortcomings for long‑chain (Chain‑of‑Thought) outputs. Its main innovations are:
Removing KL‑divergence: For long reasoning sequences the model is expected to diverge significantly from the initial policy, so the KL constraint is unnecessary.
Clip‑Higher: The clipping range in the PPO surrogate loss is split into separate upper and lower bounds, allowing a larger upper bound to prevent over‑clipping of high‑probability tokens.
Dynamic Sampling: Batches where all sampled sequences are either completely correct or completely wrong are discarded and replaced, ensuring each batch contributes meaningful gradient information.
Token‑Level Loss Weighting: Instead of averaging advantages over whole sequences, DAPO weights each token proportionally to its sequence length, giving longer sequences less per‑token influence.
Key formulas (illustrated as images) include the modified clipping function and the dynamic penalty for length truncation:
Empirical results show that DAPO converges faster than vanilla GRPO, especially on tasks requiring extensive reasoning.
04. GSPO (Group Sequence Policy Optimization)
GSPO, introduced by the Qwen‑3 team, targets both the token‑level reward‑action mismatch and the instability of importance‑sampling in mixture‑of‑experts (MoE) models. Its core idea is to move the importance‑sampling correction from token granularity to sequence granularity.
The GSPO objective is:
where the importance‑sampling term is the geometric mean of token‑level ratios across the whole sequence:
This design reduces variance caused by extreme token‑level ratios, especially in MoE models where different experts are activated for old and new policies. By aggregating at the sequence level, GSPO provides a more stable gradient while still penalizing or rewarding entire outputs appropriately.
Gradient analysis shows that GSPO’s per‑token contribution becomes uniform within a sequence, effectively behaving like a cross‑entropy loss scaled by the sequence‑level advantage.
Key Takeaways
GRPO simplifies PPO for LLMs but suffers from entropy collapse, reward noise, and token‑level importance‑sampling variance.
DAPO mitigates these issues for long‑chain reasoning by removing KL constraints, expanding the clipping upper bound, dynamically sampling batches, and applying token‑level weighting.
GSPO further resolves the reward‑action granularity mismatch by moving importance‑sampling to the sequence level, which stabilizes training for both dense and MoE architectures.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
