Sequence-level Optimization — 1 Technical Articles

Aug 7, 2025 · Artificial Intelligence

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability

The paper identifies that GRPO’s token‑level importance weighting introduces high‑variance noise causing instability in large‑scale language model RL training, and proposes GSPO, a sequence‑level importance sampling method that aligns with reward definitions, improves gradient stability, and yields higher training efficiency and better performance across benchmarks.

GRPOGSPORL

0 likes · 8 min read

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability