Understanding GRPO: Group Relative Policy Optimization for LLM Training
The article explains GRPO, a reinforcement‑learning algorithm that extends PPO with group sampling, no value network, dual penalties and KL regularisation, showing how it improves efficiency and stability when fine‑tuning large language models such as DeepSeek‑Math and DeepSeek‑R1.
Introduction
Reinforcement learning (RL) has become a powerful tool for post‑training of large language models (LLMs), especially on reasoning‑intensive tasks. DeepSeek’s DeepSeek‑Math and DeepSeek‑R1 models demonstrate the potential of RL for improving mathematical reasoning and problem‑solving abilities.
PPO vs. GRPO
Proximal Policy Optimization (PPO) has long been the default algorithm for RL fine‑tuning of language models. Its core is a clipped policy‑gradient update that limits large policy changes. The PPO objective is shown in Figure 1.
Group Relative Policy Optimization (GRPO), first introduced in the DeepSeek‑Math paper, extends PPO with several key innovations that make it more efficient for LLMs:
No value network, reducing memory and compute consumption.
Group sampling technique for more stable advantage estimation.
Dual penalty applied to both the objective and the reward function for more conservative updates.
LLM as the Policy Model
In the GRPO framework the language model acts as the actor network. The input question q is treated as the observation state s, and the model generates a sequence of tokens that serve as actions aₜ. The token distribution is factorised as shown in Figure 2.
Note: the original paper uses oₜ for the output token at time step t; this article uses aₜ to align with standard RL notation.
Serialisation of Token Generation
Because Transformers generate tokens autoregressively, the generation process is inherently sequential:
Each token depends on previously generated tokens.
The policy network (LLM) maintains the running context.
Each token generation step corresponds to an RL action aₜ.
Reward and Advantage Calculation
For each generated sequence GRPO computes per‑token rewards as illustrated in Figure 3. Unlike traditional methods, GRPO does not use a value network. Instead it normalises rewards across a group of outputs sampled from a reference policy to estimate baseline advantages (Figure 4).
GRPO Objective Function
For a given question q, GRPO samples a set of outputs {o₁,…,o_G} from the old policy π_old and maximises the GRPO objective shown in Figure 5. The objective has three distinctive features:
Dual averaging over groups and sequence length : averages across both sampled groups and token positions to ensure comprehensive optimisation.
Conservative clipping : gradient clipping limits the magnitude of policy updates, preventing collapse.
KL‑divergence penalty : a KL term regularises the new policy against the reference model, preserving training stability.
Conclusion
GRPO (Group Relative Policy Optimization) marks a significant advance in applying RL to LLMs. By eliminating the value network and introducing group‑wise relative advantage estimation, it improves training efficiency and stability. The breakthroughs achieved by DeepSeek‑Math and DeepSeek‑R1 validate its practical value.
The three pillars of GRPO—group sampling, relative‑advantage estimation, and the removal of the value network—provide a blueprint for future LLM training paradigms, and similar innovations are likely to be key to unlocking further model capabilities.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
