How GSPO Improves Stability in Large Language Model Training
GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.
