Baobao Algorithm Notes
Baobao Algorithm Notes
Jan 24, 2026 · Artificial Intelligence

What Advances Do GRPO, DAPO, GSPO, and SAPO Bring Over PPO?

After DPO, the typical research trajectory moves through GRPO, DAPO, GSPO, and SAPO, each introducing new optimization objectives, sampling strategies, and reward‑shaping techniques that aim to reduce memory usage, improve gradient stability, and enhance the efficiency of large‑model reinforcement learning.

DAPOGRPOGSPO
0 likes · 6 min read
What Advances Do GRPO, DAPO, GSPO, and SAPO Bring Over PPO?
Data Thinking Notes
Data Thinking Notes
Oct 19, 2025 · Artificial Intelligence

How GSPO Improves Stability in Large Language Model Training

GSPO (Group Sequence Policy Optimization) is a reinforcement‑learning algorithm for LLMs that replaces token‑level GRPO with sequence‑level optimization, addressing instability in ultra‑large model training, especially for long‑sequence and MoE architectures, by aligning reward granularity and reducing variance.

GRPOGSPOlarge language models
0 likes · 11 min read
How GSPO Improves Stability in Large Language Model Training
Data Party THU
Data Party THU
Sep 4, 2025 · Artificial Intelligence

Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive

This article provides a comprehensive technical analysis of PPO‑based reinforcement learning methods for large language models, detailing the evolution from the original PPO algorithm through GRPO, DAPO, and GSPO, and explaining their motivations, mathematical formulations, advantages, and practical challenges such as entropy collapse and importance‑sampling variance.

DAPOGRPOGSPO
0 likes · 30 min read
Unraveling PPO Variants: From GRPO to DAPO and GSPO – A Deep Dive
Data Party THU
Data Party THU
Aug 7, 2025 · Artificial Intelligence

Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability

The paper identifies that GRPO’s token‑level importance weighting introduces high‑variance noise causing instability in large‑scale language model RL training, and proposes GSPO, a sequence‑level importance sampling method that aligns with reward definitions, improves gradient stability, and yields higher training efficiency and better performance across benchmarks.

GRPOGSPORL
0 likes · 8 min read
Why GRPO Fails on Large LLMs and How GSPO Restores Training Stability