Understanding GRPO: Group Relative Policy Optimization in Reinforcement Learning and Large Language Models
The article reviews reinforcement-learning fundamentals and the progression from policy-gradient to PPO, then introduces Group Relative Policy Optimization (GRPO)—a critic-free method that normalizes rewards across multiple sampled outputs to compute group-relative advantages—and shows how DeepSeek-R1 leverages GRPO with rule-based rewards to achieve strong reasoning performance.
GRPO Technical Background
The GRPO (Group Relative Policy Optimization) technique was first introduced in DeepSeek's February 2023 paper "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" and later applied in the DeepSeek-R1 model. To grasp the algorithm, a brief review of fundamental reinforcement learning (RL) concepts is helpful.
Basic Reinforcement Learning Concepts
In RL, an agent interacts with an environment by taking actions, transitioning between states, and receiving rewards. The goal is to maximize cumulative reward, a framework known as a Markov Decision Process (MDP).
Key RL terminology includes:
S : state space
A : action space
π : policy (probability of taking an action given a state)
P(s'|s,a) : state‑transition probability
R(s,a) : immediate reward function
G : return (discounted sum of future rewards)
V(s) : state‑value function (expected return from state s under policy π)
Q(s,a) : action‑value function (expected return from taking action a in state s under π)
The relationship between V and Q is that V(s) = Σ_a π(a|s)·Q(s,a).
Types of RL Algorithms
RL problems are divided into model‑based (environment dynamics known) and model‑free (dynamics unknown). Model‑free methods dominate real‑world applications and are further split into value‑based (e.g., Q‑learning, DQN) and policy‑based approaches. GRPO belongs to the policy‑based family.
From Policy Gradient to PPO
Policy‑gradient methods aim to maximize the expected return J(θ) = 𝔼_τ[ G(τ) ] by adjusting policy parameters θ via gradient ascent. Because the exact gradient is intractable, samples of trajectories are used, often with importance sampling to improve data efficiency.
Importance sampling re‑weights samples from an old policy to estimate expectations under a new policy, reducing the need for fresh data collection.
However, naive importance sampling can suffer from high variance, especially when all rewards are positive, making it hard to distinguish better actions. Introducing a baseline (average reward) yields an advantage function that can be positive or negative, stabilizing training.
TRPO and PPO
Trust Region Policy Optimization (TRPO) adds a KL‑divergence constraint to keep the new policy close to the old one, but solving the constrained problem is complex. Proximal Policy Optimization (PPO) simplifies this by incorporating a clipped surrogate objective, turning the constrained optimization into an unconstrained one. PPO further introduces an adaptive KL penalty (PPO‑1) and a clipped objective (PPO‑2), the latter often performing better in practice.
PPO in Large Language Models (LLM)
When applying PPO to LLMs, four models are typically involved:
Actor Model – the language model being fine‑tuned.
Critic Model – estimates the total return (value function).
Reward Model – predicts immediate reward.
Reference Model – provides a stable baseline to prevent drift.
During RLHF‑PPO, the actor and critic are updated while the reward and reference models remain frozen. The training loop consists of generating responses, evaluating them with the other models to produce experience, and then computing actor/critic losses for parameter updates.
GRPO Algorithm
GRPO removes the critic model entirely, using the average reward of multiple sampled outputs as a baseline. The algorithm proceeds as follows:
Sample a batch of prompts; for each prompt, generate multiple outputs from the current policy.
Score each output with a reward model.
Compute a group‑relative advantage for each token by normalizing rewards (subtract group mean, divide by group standard deviation).
Update the policy by maximizing the GRPO objective (shown in the original paper).
Optionally, use the updated policy to generate new training data for the reward model and set the reference model to the current policy.
Iterate until convergence.
GRPO distinguishes between process rewards (per‑token) and result rewards (final token). Advantages are computed by normalizing these rewards within each group of sampled outputs.
GRPO in DeepSeek‑R1
DeepSeek‑R1 demonstrated that rule‑based rewards alone (accuracy reward and format reward) can achieve strong reasoning performance when combined with GRPO. An additional language‑consistency reward was later added to address mixed‑language outputs, further boosting the model.
Summary (DeepSeek version) This article systematically reviews the evolution of policy‑optimization algorithms—from basic policy gradient to TRPO, PPO, and the novel GRPO—highlighting their theoretical foundations and practical impact on large language models. The discussion suggests that reinforcement learning will continue to inspire breakthroughs across AI domains.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.