Data Party THU
Aug 7, 2025 · Artificial Intelligence
Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability
The paper identifies that GRPO’s token‑level importance weighting introduces high‑variance noise causing instability in large‑scale language model RL training, and proposes GSPO, a sequence‑level importance sampling method that aligns with reward definitions, improves gradient stability, and yields higher training efficiency and better performance across benchmarks.
GRPOGSPORL
0 likes · 8 min read
