Artificial Intelligence 14 min read

Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives

DeepSeek‑R1 Reinforcement Learning Scheme

One of the highlights of DeepSeek‑R1’s reinforcement‑learning approach is the use of the GRPO algorithm instead of the commonly used PPO in RLHF, aiming to minimise human‑labelled data by designing a pure‑RL environment with a carefully crafted reward system that lets the model learn reasoning autonomously.

1. Reinforcement Learning Basics

What Is Reinforcement Learning?

Definition: Reinforcement Learning (RL) is a branch of machine learning where an agent interacts with an environment to learn an optimal decision‑making policy by trial‑and‑error, receiving feedback (rewards or penalties) and maximising cumulative reward.

Analogy: Similar to training a puppy: correct actions earn treats (positive reward), wrong actions receive no encouragement (negative reward), eventually the puppy learns commands like “sit” or “shake”.

Background of RL

Origin: 1950s cybernetics and psychology research, early applications in robot path planning and game AI.

Core Need: Solving sequential decision‑making problems, balancing short‑term and long‑term returns.

Explosion Point: AlphaGo’s 2016 victory over Lee Sedol sparked massive interest in RL within AI.

Core Elements of RL

Agent: The decision‑making entity (e.g., robot, game character, chatbot).

Environment: The external system the agent interacts with (e.g., game rules, physics simulator).

Reward: Immediate feedback signal from the environment (e.g., score, survival time).

Policy: The agent’s decision rule – “what action to take in which state”.

Value Function: Estimates long‑term return of states or actions, helping the agent weigh immediate vs future rewards.

2. Core RL Method Classifications

Value‑Based: Estimates the value (Q‑value) of each state/action and selects the optimal action. Typical algorithms: Q‑Learning, DQN.

Policy Gradient: Directly optimises the policy network by gradient ascent to maximise expected reward. Typical algorithm: REINFORCE.

Actor‑Critic: Combines policy gradient (actor) with a value estimator (critic). Typical algorithms: A3C, PPO.

3. RLHF (Reinforcement Learning from Human Feedback)

Analogy

Like a tutoring class for AI: a human teacher provides feedback on the model’s outputs, guiding it toward human‑aligned behavior.

Core Idea

Replace or adjust the environment’s reward with human‑derived feedback so the AI aligns with human values.

RLHF vs Traditional RL

Reward Source: Traditional RL uses automatic environment feedback (e.g., game score); RLHF uses human annotations or a reward‑model prediction.

Data Requirement: Traditional RL needs massive interaction data; RLHF needs human preference data plus interaction data.

Application Scenarios: Traditional RL fits clear‑rule tasks like game AI or robot control; RLHF is used for language‑model alignment and ethically sensitive tasks.

Advantages: Traditional RL requires no human intervention; RLHF better captures human subjective preferences.

Challenges: Designing reward functions is hard; RLHF incurs high human‑labeling cost and possible bias in the reward model.

Through RLHF, AI can not only achieve target tasks but also understand human intent and values, which is a key use case for DPO, PPO, and GRPO.

4. PPO (Proximal Policy Optimization)

Analogy

Like a fitness coach’s “safe training plan”: the trainee’s workload changes only slightly each session to avoid injury.

Core Idea

Define a “safe range” so that policy updates are small and stable, preventing catastrophic performance drops.

Key Principles

Policy Gradient: Adjust the policy based on the advantage function (how much better an action is compared to the average).

Clip Mechanism: Restrict policy change to within ±20% (ε=0.2) each update.

Critic Role: A value network evaluates long‑term effects of actions, similar to a fitness assessor.

Pros & Cons

Advantages: Stable and controllable, suitable for complex tasks such as robot locomotion or game‑AI boss fights.

Disadvantages: Requires a critic network and large training data, leading to high computational cost.

Practical Applications

ChatGPT fine‑tuning: combines PPO with a human‑feedback reward model to generate more natural responses.

5. DPO (Direct Preference Optimisation)

Analogy

Like a student improving essays directly from a teacher’s comments without a scoring rubric.

Core Idea

Skip the reward‑model step and optimise the policy directly from human‑annotated preference pairs.

Key Principles

Flaws of Traditional RLHF: Two‑stage process (reward model training → policy optimisation) is complex and error‑prone.

DPO Simplification: Directly tell the model “Answer A is better than Answer B”.

Loss Function:

L_DPO = -log σ(β·log π_θ(y_w|x) - log π_θ(y_l|x))

where σ is the sigmoid, β controls optimisation strength, and π_θ denotes the policy probability.

Input Data: Preference pairs (e.g., “Answer A is clear, Answer B is off‑topic”).

Optimization Goal: Increase the probability of the preferred answer over the less‑preferred one.

Pros & Cons

Advantages: No reward‑model training, low memory usage, fast fine‑tuning for dialogue models.

Disadvantages: Relies heavily on high‑quality preference data; biased annotations can mislead the model.

Practical Applications

Dialogue model alignment – making AI refuse harmful queries.

Text‑summary optimisation – using click‑through preference signals to produce more engaging summaries.

6. GRPO (Group‑Relative Policy Optimisation)

Analogy

Like a talent‑show where contestants in the same group perform the same piece and are judged relative to each other.

Core Idea

Optimise policies by comparing candidates within a group rather than using absolute scores, reducing the need for a critic network.

Key Principles

Group‑Relative Reward: Generate multiple candidates for the same query and compare them.

Reward Normalisation: Convert group rewards to standardized scores (e.g., advantage measured in standard deviations).

Formula Example:

Group Statistics: μ_group – average reward of the group; σ_group – standard deviation of group rewards.

Omit Critic Network: Traditional PPO needs a critic to predict scores; GRPO computes advantages on‑the‑fly from group comparisons, saving memory.

Stability Controls: KL‑divergence penalty and clipping mechanism limit policy drift, similar to PPO’s clip.

Pros & Cons

Advantages: Reduces GPU memory by ~50%, suitable for resource‑constrained scenarios (e.g., mobile training); multi‑candidate generation enhances diversity.

Disadvantages: Inference requires generating several candidates, increasing latency.

Practical Applications

Mathematical reasoning – DeepSeek‑R1 improves GSM8K accuracy to 51.7% by multi‑answer ranking.

Code generation – produces several implementations and selects the most concise and efficient one.

7. Summary & Application Recommendations

Core Comparison

Goal: PPO – maximise cumulative reward; DPO – directly align with human preferences; GRPO – optimise group‑relative rewards.

Data Dependency: PPO – environment interaction or reward model; DPO – high‑quality preference pairs; GRPO – multiple candidates per query.

Computational Complexity: PPO – high (policy + value networks); DPO – low (policy only); GRPO – medium (no value network but multi‑candidate generation).

Suitable Scenarios: PPO – robot control, game AI; DPO – dialogue model fine‑tuning, text generation; GRPO – math reasoning, resource‑sensitive tasks.

Application Advice

PPO: Use for complex tasks requiring environment interaction or a reward model (e.g., robot control).

DPO: Choose when high‑quality preference data is available and rapid language‑model fine‑tuning is needed (e.g., conversational AI).

GRPO: Prefer for memory‑limited or diversity‑focused tasks such as mathematical reasoning or code generation.

Reinforcement LearningRLHFGRPOPPOAI alignmentDPO
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.