Understanding GRPO: Group Relative Policy Optimization for LLM Training

The article explains GRPO, a reinforcement‑learning algorithm that extends PPO with group sampling, no value network, dual penalties and KL regularisation, showing how it improves efficiency and stability when fine‑tuning large language models such as DeepSeek‑Math and DeepSeek‑R1.

AI Algorithm Path
AI Algorithm Path
AI Algorithm Path
Understanding GRPO: Group Relative Policy Optimization for LLM Training

Introduction

Reinforcement learning (RL) has become a powerful tool for post‑training of large language models (LLMs), especially on reasoning‑intensive tasks. DeepSeek’s DeepSeek‑Math and DeepSeek‑R1 models demonstrate the potential of RL for improving mathematical reasoning and problem‑solving abilities.

PPO vs. GRPO

Proximal Policy Optimization (PPO) has long been the default algorithm for RL fine‑tuning of language models. Its core is a clipped policy‑gradient update that limits large policy changes. The PPO objective is shown in Figure 1.

Group Relative Policy Optimization (GRPO), first introduced in the DeepSeek‑Math paper, extends PPO with several key innovations that make it more efficient for LLMs:

No value network, reducing memory and compute consumption.

Group sampling technique for more stable advantage estimation.

Dual penalty applied to both the objective and the reward function for more conservative updates.

LLM as the Policy Model

In the GRPO framework the language model acts as the actor network. The input question q is treated as the observation state s, and the model generates a sequence of tokens that serve as actions aₜ. The token distribution is factorised as shown in Figure 2.

Note: the original paper uses oₜ for the output token at time step t; this article uses aₜ to align with standard RL notation.

Serialisation of Token Generation

Because Transformers generate tokens autoregressively, the generation process is inherently sequential:

Each token depends on previously generated tokens.

The policy network (LLM) maintains the running context.

Each token generation step corresponds to an RL action aₜ.

Reward and Advantage Calculation

For each generated sequence GRPO computes per‑token rewards as illustrated in Figure 3. Unlike traditional methods, GRPO does not use a value network. Instead it normalises rewards across a group of outputs sampled from a reference policy to estimate baseline advantages (Figure 4).

GRPO Objective Function

For a given question q, GRPO samples a set of outputs {o₁,…,o_G} from the old policy π_old and maximises the GRPO objective shown in Figure 5. The objective has three distinctive features:

Dual averaging over groups and sequence length : averages across both sampled groups and token positions to ensure comprehensive optimisation.

Conservative clipping : gradient clipping limits the magnitude of policy updates, preventing collapse.

KL‑divergence penalty : a KL term regularises the new policy against the reference model, preserving training stability.

Conclusion

GRPO (Group Relative Policy Optimization) marks a significant advance in applying RL to LLMs. By eliminating the value network and introducing group‑wise relative advantage estimation, it improves training efficiency and stability. The breakthroughs achieved by DeepSeek‑Math and DeepSeek‑R1 validate its practical value.

The three pillars of GRPO—group sampling, relative‑advantage estimation, and the removal of the value network—provide a blueprint for future LLM training paradigms, and similar innovations are likely to be key to unlocking further model capabilities.

large language modelsDeepSeekreinforcement learningGRPOPPO
AI Algorithm Path
Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.