How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
How GSPO Improves Stability in Large Language Model Training

Alibaba Qwen team introduced GSPO (Group Sequence Policy Optimization), a reinforcement learning algorithm designed for large language models (LLMs) to address stability issues of the mainstream GRPO (Group Relative Policy Optimization) when training ultra‑large models.

1. Background of GSPO

Before GSPO, algorithms like GRPO often caused model collapse when training large models, especially with long sequences or MoE (Mixture‑of‑Experts) models. The root cause is a mismatch between optimization granularity and reward granularity.

Reward granularity: Rewards are computed over the entire sequence (e.g., evaluating the quality of a whole answer).

GRPO optimization: Optimizes at the token level, assigning an “importance weight” to each token, which can fluctuate wildly across a sequence, accumulating noise and leading to unstable gradients, particularly in MoE models where expert activation varies.

2. Core Differences between GSPO and GRPO

1. Nature of the Optimization Unit

(1) GRPO’s token‑level optimization : Calculates an importance weight for each token, leading to high‑variance noise and a mismatch between token‑level updates and sequence‑level rewards.

High‑variance noise: Weight fluctuations for individual tokens (e.g., “apple”) accumulate like a snowball as the sentence grows.

Unit mismatch: Rewards are based on the whole sequence, but optimization occurs token‑wise, causing a disconnect between local adjustments and the global objective.

(2) GSPO’s sequence‑level optimization : Treats the entire sequence as a single unit, computing a sequence‑level importance ratio, applying length normalization, and performing sequence‑level clipping.

Sequence‑level importance ratio: Computes the difference in generation probability of the whole sequence between new and old policies.

Length normalization: Uses different scaling factors for sequences of varying lengths to prevent amplification of variance.

Sequence‑level clipping: Excludes sequences whose overall likelihood deviates beyond a preset range, rather than clipping individual tokens.

Intuitive comparison: GRPO is like grading each sentence of an essay separately, causing the total score to swing wildly; GSPO gives the essay a single overall score and adjusts each sentence based on that stable total.

2. Breakthrough for MoE Model Training

MoE models improve efficiency by dynamically invoking expert modules, but the variability of expert activation makes GRPO’s token‑level weights ineffective, requiring costly “router replay” to reuse expert states.

GSPO focuses on sequence‑level likelihood, which remains stable despite expert switches, thereby eliminating the root cause of training instability.

3. Technical Innovations of GSPO

1. GSPO Workflow

The core workflow of GSPO during training includes computing sequence‑level importance ratios, applying length normalization, and performing sequence‑level clipping to ensure stable updates.

2. Sequence‑level Importance Ratio Design

GSPO computes the geometric mean of sequence likelihoods, averaging token‑level fluctuations and reducing variance.

3. Group Advantage Estimation and Clipping Mechanism

Responses to the same query are grouped; each group’s average reward and standard deviation are computed to obtain a relative advantage for each sequence. Sequences with advantage values outside a preset range are clipped at the sequence level.

4. GSPO‑token Flexible Extension

GSPO‑token introduces a detach operation to keep sequence‑level stability while allowing token‑level advantage adjustments for critical tokens (e.g., mathematical symbols), enabling fine‑grained control without destabilizing the whole model.

3. Summary

GSPO achieves stability in large‑scale LLM training by shifting from token‑level local optimization to a global‑plus‑local collaborative approach, offering high training efficiency, robust signal quality, and seamless compatibility with MoE architectures. Its sequence‑level design and optional GSPO‑token extension make it a promising standard for future ultra‑large model training.

large language modelsreinforcement learningGRPOsequence optimizationGSPO
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.