Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability
The paper identifies that GRPO’s token‑level importance weighting introduces high‑variance noise causing instability in large‑scale language model RL training, and proposes GSPO, a sequence‑level importance sampling method that aligns with reward definitions, improves gradient stability, and yields higher training efficiency and better performance across benchmarks.
Introduction
Applying Generalized PPO (GRPO) to very large language models often leads to training instability. The instability is traced to GRPO’s token‑level importance weight, which injects high‑variance noise that grows with response length and clipping, eventually causing collapse.
Motivation
During the RL phase a large rollout batch is split into mini‑batches for gradient updates, creating off‑policy samples. PPO‑style clipping mitigates extreme off‑policy samples, but GRPO’s importance weighting is fundamentally flawed: it treats a single sampled trajectory as if it were the expectation over many samples from the behavior distribution, violating the definition of importance sampling.
Algorithm (GSPO)
Group Sequence Policy Optimization (GSPO) replaces token‑level importance weights with a single sequence‑level importance weight. The weight is the average likelihood of the whole response under the current policy divided by the likelihood under the behavior policy, matching the reward definition that scores an entire sequence. Gradient computation is performed at the sequence level, and clipping is applied to the whole response ratio, yielding a coherent clipping threshold.
GSPO‑token variant
For scenarios such as multi‑turn RL where fine‑grained token adjustments are desired, GSPO‑token retains the same theoretical gradient as GSPO but operates at the token level by treating the sequence‑level weight as a constant (stop‑gradient). Thus the optimization objective, clipping condition, and gradient are numerically identical to GSPO.
Experiments
Results
Experiments use a cold‑start model fine‑tuned with SFT on Qwen3‑30B‑A3B‑Base. Two settings are compared: GSPO (without routing‑replay) and GRPO combined with a routing‑replay strategy. Training reward curves and downstream performance on AIME24, LiveCodeBench, and CodeForces (Elo rating) are reported.
GSPO achieves higher training efficiency on Qwen3 compared with GRPO.
Clip‑ratio observations
GSPO clips the entire response sequence, while GRPO clips only tokens identified as over off‑policy. The proportion of clipped tokens during training is visualized.
Although GSPO clips more tokens, it still yields higher training efficiency, indicating a more reliable learning signal.
Effect in MoE training
Background: In Mixture‑of‑Experts (MoE) models, GRPO’s token‑level importance weights become unstable because expert activation can change dramatically after each gradient update, causing divergent importance weights and training collapse.
Prior work mitigates this with a routing‑replay strategy: the activated experts for each sample are stored and replayed during importance‑weight computation, ensuring numerator and denominator use the same routing.
GSPO achieves comparable or better stability and efficiency without the extra memory and communication overhead of routing replay, because sequence‑level likelihoods are inherently more stable in MoE outputs.
References
Team Qwen. Qwen3 Technical Report . arXiv preprint arXiv:2505.09388, 2025.
MiniMax. Minimax‑m1: Scaling test‑time compute efficiently with lightning attention . arXiv preprint.
Code example
来源:深度学习自然语言处理
作者:绊缘
https://zhuanlan.zhihu.com/p/1932829167801574272
本文
约1800字
,建议阅读
5
分钟
在本论文中,我们旨在通过关注多模态智能的三个关键维度来推动该领域的发展:多模态对齐性、鲁棒性和泛化性
。Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
