How GRPO Revolutionizes RLHF for Large Language Models

This article explains the motivation, mathematical foundations, implementation details, advantages, experimental results, and future directions of Group Relative Policy Optimization (GRPO), a novel reinforcement‑learning algorithm that replaces PPO’s value network with efficient group‑wise relative evaluation for large language models.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
How GRPO Revolutionizes RLHF for Large Language Models

Using four articles, we progressively introduce reinforcement learning fundamentals, RLHF, PPO, and finally GRPO.

1. Motivation and Background

1.1 Challenges of RL in Large Language Models

Computational resource consumption : PPO requires both policy and value networks, doubling memory and compute for LLMs.

Value estimation difficulty : Rewards are typically given only at the sequence end, making token‑level value prediction hard.

Training instability : Value function training for large models often over‑ or under‑fits.

Advantage estimation precision : PPO relies on accurate advantage estimates, which are noisy for language tasks.

1.2 Innovations of GRPO

DeepSeek proposes Group Relative Policy Optimization (GRPO) with four core ideas:

Avoiding the value network : eliminates the need for a value network, cutting compute and memory.

Group relative evaluation : generates multiple answers for the same query and compares them to compute advantages.

Alignment with reward models : RLHF reward models already compare answer quality, making GRPO a natural fit.

Improved KL‑regularization : adds KL directly to the objective for more stable training.

GRPO aims to solve PPO’s efficiency and stability issues, especially in resource‑constrained settings.

2. Mathematical Principles of GRPO

2.1 PPO Review

The standard PPO objective is:

πθ: current policy

πθold: old policy

z: prompt or problem

y: generated output sequence

A_t: advantage at timestep t

ε: clipping parameter

PPO computes advantages via Generalized Advantage Estimation (GAE), which requires an extra value network.

2.2 Core GRPO Formula

GRPO’s objective replaces the GAE‑based advantage with a group‑wise relative advantage:

J_GRPO(θ) = 1/G * sum(min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)) - β * KL

Key differences:

Group sampling : sample G different outputs from the old policy for each query.

Advantage computation : compute advantages from relative rewards within the group, without a value network.

KL regularization : add KL term directly to the objective.

Unbiased KL estimate : use a more accurate KL estimator.

2.3 Supervision Types

Outcome supervision : evaluate only the final output, assign the same advantage to all tokens.

Compute group reward mean and standard deviation.

Normalize each reward and set it as the token‑wise advantage.

Process supervision : evaluate each reasoning step, sum step rewards to obtain token‑wise advantages.

3. Practical Implementation of GRPO

Implementation steps:

3.1 Initialization

Policy model: usually an SFT‑pretrained LLM.

Reward model: trained to score answer quality.

Reference model: copy of the policy model for KL computation.

Hyper‑parameters: clipping ε, KL coefficient β, group size G, etc.

Note : No value network is required.

3.2 Main Training Loop

3.2a Create reference model copy

Set reference model = current policy; keep it fixed for the iteration.

3.2b Collect group data

For each question, generate G responses with the current policy.

Record token probabilities.

Compute rewards: either outcome scores for whole answers or step scores for process supervision.

3.2c Compute relative advantages

Outcome: normalize group rewards and assign the same advantage to all tokens.

Process: normalize step rewards and sum them for each token.

3.2d Update policy

compute_grpo_objective_and_update(policy_model, reference_model, batch_data, epsilon, kl_coef)

3.2e Optional reward model update (Iterative GRPO)

Generate new data with the updated policy.

Update reward model using a small portion of historical data.

Set the updated policy as the new reference model.

Continue training the policy with the refreshed reward model.

3.3 Pseudocode

def train_GRPO(policy_model, reward_model, train_data, group_size=64, epsilon=0.2, kl_coef=0.04, iterations=10, updates_per_iteration=1, batch_size=16, supervision_type="outcome"):
    for iteration in range(iterations):
        reference_model = copy.deepcopy(policy_model)
        group_data = []
        for question in train_data:
            responses = []
            for _ in range(group_size):
                response = policy_model.generate(question)
                responses.append(response)
            if supervision_type == "outcome":
                rewards = [reward_model.score(question, r) for r in responses]
            else:
                rewards = [reward_model.score_steps(question, r) for r in responses]
            group_data.append({'question': question, 'responses': responses, 'rewards': rewards})
        for _ in range(updates_per_iteration):
            batch_indices = random.sample(range(len(group_data)), batch_size)
            batch_data = [group_data[i] for i in batch_indices]
            for group in batch_data:
                if supervision_type == "outcome":
                    mean_reward = sum(group['rewards']) / len(group['rewards'])
                    std_reward = calculate_std(group['rewards'])
                    for i, reward in enumerate(group['rewards']):
                        normalized = (reward - mean_reward) / std_reward
                        group['responses'][i]['advantages'] = [normalized] * len(group['responses'][i]['tokens'])
                else:
                    process_advantages(group)
            compute_grpo_objective_and_update(policy_model, reference_model, batch_data, epsilon, kl_coef)
    return policy_model

4. Advantages and Innovations of GRPO

4.1 Computational Efficiency

Avoiding the value network reduces parameters by ~50%.

Fewer forward/backward passes lower overall compute.

Memory usage drops because the value network’s parameters and optimizer states are eliminated.

4.2 Improved Advantage Estimation

Group‑wise relative evaluation yields more accurate advantages by reducing noise.

No need to fit a precise value function.

Matches the comparative nature of reward‑model training.

4.3 Flexible Supervision

Outcome supervision : simple, evaluates whole output quality.

Process supervision : fine‑grained, suitable for tasks requiring step‑wise reasoning (e.g., math).

4.4 Enhanced KL Regularization

KL term added directly to the objective, improving stability.

Uses an unbiased KL estimator for higher precision.

4.5 Iterative Training Mechanism

Iterative GRPO synchronously updates the reward model and policy model, addressing the lag of the reward model behind the improving policy.

5. Empirical Results

Math reasoning : DeepSeekMath‑RL 7B (trained with GRPO) achieves 88.2% on GSM8K and 51.7% on MATH, surpassing all open‑source models from 7B to 70B and even some closed‑source models.

Computational efficiency : Compared to PPO, GRPO markedly reduces memory usage and training time, enabling large‑model training on limited resources.

Cross‑task generalization : Despite training only on GSM8K and MATH, the model performs well on Chinese benchmarks (MGSM‑zh, CMATH).

Tool integration : In tool‑enabled evaluation, DeepSeekMath‑RL 7B reaches ~60% accuracy on MATH, showing advantages when combined with external tools.

GRPO also proved effective when combined with rule‑based rewards in DeepSeek‑R1 training, challenging the belief that process‑based reward models always outperform rule‑based ones.

6. Comparison Between GRPO and PPO

(Table removed for brevity; key differences are summarized above.)

7. Limitations and Future Directions

Potential limitations :

Increased sampling cost due to generating many responses per query.

Sensitivity to group size G, requiring a trade‑off between diversity and efficiency.

Reliance on reward distribution; overly concentrated rewards may degrade relative advantage accuracy.

General applicability beyond math reasoning still needs verification.

Future research avenues :

Dynamic group size adaptation based on problem difficulty.

Hybrid advantage estimation combining GAE and group‑wise methods.

Multi‑objective optimization to handle multiple reward signals.

Distributed implementations for higher parallel efficiency.

Integration with other advanced techniques such as constitutional AI or self‑supervised RLHF.

GRPO offers a promising path for efficient LLM training under limited resources, and its continued development may broaden its impact across diverse tasks and larger models.

Artificial IntelligenceLLMreinforcement learningRLHFGRPOPPO
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.