How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models
This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.
Preface
Four articles introduce reinforcement learning basics, RLHF, PPO, and GRPO in a progressive manner.
1. Motivation and Background
1.1 Challenges of RL in Large Language Models
Computational resource consumption : PPO requires both policy and value networks, leading to huge memory and compute costs for LLMs.
Value estimation difficulty : Rewards are given only at the end of sequences, making it hard for the value network to predict intermediate states.
Training instability : Value function training for LLMs often overfits or underfits, harming learning.
Advantage estimation precision : PPO relies on accurate advantage estimates, which are often imprecise for language models.
1.2 Innovative Idea of GRPO
Eliminate value network : GRPO removes the need for a value network, greatly reducing compute and memory.
Group relative evaluation : Generates multiple answers for the same query and computes advantage from their relative quality.
Match with reward model : RLHF reward models are trained by comparing answers, aligning naturally with GRPO’s relative evaluation.
Improved KL regularization : Refines KL control for more stable training.
GRPO aims to solve PPO’s efficiency and stability issues, especially for resource‑constrained training.
2. Mathematical Foundations of GRPO
2.1 Standard PPO Review
The PPO objective optimizes a clipped surrogate loss using a ratio between the current and old policy probabilities, a clipping parameter ε, and an advantage term estimated by Generalized Advantage Estimation (GAE), which requires a separate value network.
2.2 Core GRPO Formula
GRPO’s objective replaces the advantage term with a group‑relative estimate and adds a KL term directly:
J_GRPO(θ) = 1/G * sum(min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)) - β * KLKey differences:
Group sampling : Sample G answers from the old policy for each query.
Advantage computation : No value network; advantage derived from relative rewards within the group.
KL regularization : KL term is added directly to the objective.
Unbiased KL estimate : Uses a more accurate KL estimator.
2.3 Supervision Types
Outcome Supervision : Reward is computed only for the final output; the same normalized reward is assigned to all tokens.
Process Supervision : Rewards are computed for each reasoning step; token advantage is the sum of subsequent step rewards.
Iterative GRPO : Updates the reward model periodically using newly generated data while keeping a small portion of historical data.
3. Practical Implementation of GRPO
3.1 Initialization
Policy model: usually a supervised‑fine‑tuned LLM.
Reward model: trained to evaluate answer quality.
Reference model: copy of the policy model for KL computation.
Hyper‑parameters: clipping ε, KL coefficient β, group size G, etc.
Note: No value network is required.
3.2 Main Training Loop
3.2a Create reference model copy
3.2b Collect group data
For each question, generate G responses, record token probabilities, and compute rewards using either outcome or process supervision.
3.2c Compute relative advantages
Outcome: normalize group rewards and assign the same advantage to all tokens. Process: standardize step rewards and sum them for each token.
3.2d Update policy model
J_GRPO(θ) = 1/G * sum(min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)) - β * KLGradients are computed (typically with Adam) and the policy parameters are updated. Multiple updates per group are possible.
3.2e Optional reward model update
If using iterative GRPO, generate new data with the current policy, mix in a small fraction of historical data, retrain the reward model, and set the updated policy as the new reference.
3.3 Pseudocode
def train_GRPO(
policy_model, # initial policy (usually SFT)
reward_model, # reward model
train_data, # training dataset
group_size=64,
epsilon=0.2,
kl_coef=0.04,
iterations=10,
updates_per_iteration=1,
batch_size=16,
supervision_type="outcome"
):
for iteration in range(iterations):
reference_model = copy.deepcopy(policy_model)
group_data = []
for question in train_data:
responses = [policy_model.generate(question) for _ in range(group_size)]
if supervision_type == "outcome":
rewards = [reward_model.score(question, r) for r in responses]
else:
rewards = [reward_model.score_steps(question, r) for r in responses]
group_data.append({'question': question, 'responses': responses, 'rewards': rewards})
for _ in range(updates_per_iteration):
batch = random.sample(group_data, batch_size)
for group in batch:
if supervision_type == "outcome":
mean = sum(group['rewards']) / len(group['rewards'])
std = calculate_std(group['rewards'])
for i, r in enumerate(group['rewards']):
norm = (r - mean) / std
group['responses'][i]['advantages'] = [norm] * len(group['responses'][i]['tokens'])
else:
process_advantages(group)
compute_grpo_objective_and_update(policy_model, reference_model, batch, epsilon, kl_coef)
return policy_model4. Advantages and Innovations of GRPO
4.1 Computational Efficiency
Avoids value network, reducing parameters by ~50%.
Fewer forward/backward passes.
Lower memory usage.
4.2 Improved Advantage Estimation
More accurate via group relative comparison.
No need for precise value function fitting.
Naturally aligns with reward‑model training.
4.3 Flexible Supervision
Outcome supervision: simple, evaluates overall output quality.
Process supervision: fine‑grained, suitable for tasks requiring step‑wise quality (e.g., math reasoning).
4.4 Enhanced KL Regularization
KL term added directly to the objective.
Uses unbiased KL estimator for higher precision.
4.5 Iterative Training Mechanism
Synchronously updates reward and policy models, addressing the lag of the reward model as the policy improves.
5. Empirical Results
Mathematical reasoning: DeepSeekMath‑RL 7B trained with GRPO achieves 88.2% on GSM8K and 51.7% on MATH, surpassing larger open‑source models.
Computational efficiency: Compared to PPO, GRPO markedly reduces memory and training time, enabling large‑model training on limited resources.
Cross‑task generalization: Despite training only on GSM8K and MATH, the model performs well on Chinese benchmarks such as MGSM‑zh and CMATH.
Tool integration: In tool‑allowed settings, the model reaches ~60% accuracy on MATH, showing advantages when combined with external tools.
6. Comparison Between PPO and GRPO
Key differences include the elimination of the value network, group‑relative advantage estimation, lower computational complexity, direct KL regularization, and better suitability for tasks with comparative rewards.
7. Limitations and Future Directions
Potential limitations include increased sampling cost due to multiple responses per query, sensitivity to group size, dependence on reward distribution, and unverified generality beyond math tasks.
Future work may explore dynamic group sizes, hybrid advantage estimation combining GAE and group‑relative methods, multi‑objective optimization, distributed implementations, and integration with other techniques such as constitutional AI.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
