How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

This article explains the GRPO algorithm, an improvement over PPO for large language model training that eliminates the value network, uses group‑relative advantage estimation, and offers flexible supervision, resulting in higher efficiency, stability, and performance on tasks such as mathematical reasoning.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
How GRPO Revolutionizes RLHF: Efficient, Stable Training for Large Language Models

Preface

Four articles introduce reinforcement learning basics, RLHF, PPO, and GRPO in a progressive manner.

1. Motivation and Background

1.1 Challenges of RL in Large Language Models

Computational resource consumption : PPO requires both policy and value networks, leading to huge memory and compute costs for LLMs.

Value estimation difficulty : Rewards are given only at the end of sequences, making it hard for the value network to predict intermediate states.

Training instability : Value function training for LLMs often overfits or underfits, harming learning.

Advantage estimation precision : PPO relies on accurate advantage estimates, which are often imprecise for language models.

1.2 Innovative Idea of GRPO

Eliminate value network : GRPO removes the need for a value network, greatly reducing compute and memory.

Group relative evaluation : Generates multiple answers for the same query and computes advantage from their relative quality.

Match with reward model : RLHF reward models are trained by comparing answers, aligning naturally with GRPO’s relative evaluation.

Improved KL regularization : Refines KL control for more stable training.

GRPO aims to solve PPO’s efficiency and stability issues, especially for resource‑constrained training.

GRPO vs PPO illustration
GRPO vs PPO illustration

2. Mathematical Foundations of GRPO

2.1 Standard PPO Review

The PPO objective optimizes a clipped surrogate loss using a ratio between the current and old policy probabilities, a clipping parameter ε, and an advantage term estimated by Generalized Advantage Estimation (GAE), which requires a separate value network.

2.2 Core GRPO Formula

GRPO’s objective replaces the advantage term with a group‑relative estimate and adds a KL term directly:

J_GRPO(θ) = 1/G * sum(min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)) - β * KL

Key differences:

Group sampling : Sample G answers from the old policy for each query.

Advantage computation : No value network; advantage derived from relative rewards within the group.

KL regularization : KL term is added directly to the objective.

Unbiased KL estimate : Uses a more accurate KL estimator.

2.3 Supervision Types

Outcome Supervision : Reward is computed only for the final output; the same normalized reward is assigned to all tokens.

Process Supervision : Rewards are computed for each reasoning step; token advantage is the sum of subsequent step rewards.

Iterative GRPO : Updates the reward model periodically using newly generated data while keeping a small portion of historical data.

3. Practical Implementation of GRPO

3.1 Initialization

Policy model: usually a supervised‑fine‑tuned LLM.

Reward model: trained to evaluate answer quality.

Reference model: copy of the policy model for KL computation.

Hyper‑parameters: clipping ε, KL coefficient β, group size G, etc.

Note: No value network is required.

3.2 Main Training Loop

3.2a Create reference model copy

3.2b Collect group data

For each question, generate G responses, record token probabilities, and compute rewards using either outcome or process supervision.

3.2c Compute relative advantages

Outcome: normalize group rewards and assign the same advantage to all tokens. Process: standardize step rewards and sum them for each token.

3.2d Update policy model

J_GRPO(θ) = 1/G * sum(min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)) - β * KL

Gradients are computed (typically with Adam) and the policy parameters are updated. Multiple updates per group are possible.

3.2e Optional reward model update

If using iterative GRPO, generate new data with the current policy, mix in a small fraction of historical data, retrain the reward model, and set the updated policy as the new reference.

3.3 Pseudocode

def train_GRPO(
    policy_model,          # initial policy (usually SFT)
    reward_model,          # reward model
    train_data,            # training dataset
    group_size=64,
    epsilon=0.2,
    kl_coef=0.04,
    iterations=10,
    updates_per_iteration=1,
    batch_size=16,
    supervision_type="outcome"
):
    for iteration in range(iterations):
        reference_model = copy.deepcopy(policy_model)
        group_data = []
        for question in train_data:
            responses = [policy_model.generate(question) for _ in range(group_size)]
            if supervision_type == "outcome":
                rewards = [reward_model.score(question, r) for r in responses]
            else:
                rewards = [reward_model.score_steps(question, r) for r in responses]
            group_data.append({'question': question, 'responses': responses, 'rewards': rewards})
        for _ in range(updates_per_iteration):
            batch = random.sample(group_data, batch_size)
            for group in batch:
                if supervision_type == "outcome":
                    mean = sum(group['rewards']) / len(group['rewards'])
                    std = calculate_std(group['rewards'])
                    for i, r in enumerate(group['rewards']):
                        norm = (r - mean) / std
                        group['responses'][i]['advantages'] = [norm] * len(group['responses'][i]['tokens'])
                else:
                    process_advantages(group)
            compute_grpo_objective_and_update(policy_model, reference_model, batch, epsilon, kl_coef)
    return policy_model

4. Advantages and Innovations of GRPO

4.1 Computational Efficiency

Avoids value network, reducing parameters by ~50%.

Fewer forward/backward passes.

Lower memory usage.

4.2 Improved Advantage Estimation

More accurate via group relative comparison.

No need for precise value function fitting.

Naturally aligns with reward‑model training.

4.3 Flexible Supervision

Outcome supervision: simple, evaluates overall output quality.

Process supervision: fine‑grained, suitable for tasks requiring step‑wise quality (e.g., math reasoning).

4.4 Enhanced KL Regularization

KL term added directly to the objective.

Uses unbiased KL estimator for higher precision.

4.5 Iterative Training Mechanism

Synchronously updates reward and policy models, addressing the lag of the reward model as the policy improves.

5. Empirical Results

Mathematical reasoning: DeepSeekMath‑RL 7B trained with GRPO achieves 88.2% on GSM8K and 51.7% on MATH, surpassing larger open‑source models.

Computational efficiency: Compared to PPO, GRPO markedly reduces memory and training time, enabling large‑model training on limited resources.

Cross‑task generalization: Despite training only on GSM8K and MATH, the model performs well on Chinese benchmarks such as MGSM‑zh and CMATH.

Tool integration: In tool‑allowed settings, the model reaches ~60% accuracy on MATH, showing advantages when combined with external tools.

6. Comparison Between PPO and GRPO

Key differences include the elimination of the value network, group‑relative advantage estimation, lower computational complexity, direct KL regularization, and better suitability for tasks with comparative rewards.

7. Limitations and Future Directions

Potential limitations include increased sampling cost due to multiple responses per query, sensitivity to group size, dependence on reward distribution, and unverified generality beyond math tasks.

Future work may explore dynamic group sizes, hybrid advantage estimation combining GAE and group‑relative methods, multi‑objective optimization, distributed implementations, and integration with other techniques such as constitutional AI.

Reinforcement learningRLHFLLM trainingGRPOPPOAI Optimization
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.