Why DeepSeek R1 Swaps PPO for GRPO: A Deep Dive into RLHF Alternatives
DeepSeek‑R1 replaces the traditional PPO‑based RLHF approach with GRPO, reducing reliance on human‑labeled data by using pure reinforcement learning environments and carefully designed reward mechanisms; the article explains reinforcement learning fundamentals, compares PPO, DPO and GRPO, and offers practical application recommendations.
DeepSeek‑R1 Reinforcement Learning Scheme
One of the highlights of DeepSeek‑R1’s reinforcement‑learning approach is the use of the GRPO algorithm instead of the commonly used PPO in RLHF, aiming to minimise human‑labelled data by designing a pure‑RL environment with a carefully crafted reward system that lets the model learn reasoning autonomously.
1. Reinforcement Learning Basics
What Is Reinforcement Learning?
Definition: Reinforcement Learning (RL) is a branch of machine learning where an agent interacts with an environment to learn an optimal decision‑making policy by trial‑and‑error, receiving feedback (rewards or penalties) and maximising cumulative reward.
Analogy: Similar to training a puppy: correct actions earn treats (positive reward), wrong actions receive no encouragement (negative reward), eventually the puppy learns commands like “sit” or “shake”.
Background of RL
Origin: 1950s cybernetics and psychology research, early applications in robot path planning and game AI.
Core Need: Solving sequential decision‑making problems, balancing short‑term and long‑term returns.
Explosion Point: AlphaGo’s 2016 victory over Lee Sedol sparked massive interest in RL within AI.
Core Elements of RL
Agent: The decision‑making entity (e.g., robot, game character, chatbot).
Environment: The external system the agent interacts with (e.g., game rules, physics simulator).
Reward: Immediate feedback signal from the environment (e.g., score, survival time).
Policy: The agent’s decision rule – “what action to take in which state”.
Value Function: Estimates long‑term return of states or actions, helping the agent weigh immediate vs future rewards.
2. Core RL Method Classifications
Value‑Based: Estimates the value (Q‑value) of each state/action and selects the optimal action. Typical algorithms: Q‑Learning, DQN.
Policy Gradient: Directly optimises the policy network by gradient ascent to maximise expected reward. Typical algorithm: REINFORCE.
Actor‑Critic: Combines policy gradient (actor) with a value estimator (critic). Typical algorithms: A3C, PPO.
3. RLHF (Reinforcement Learning from Human Feedback)
Analogy
Like a tutoring class for AI: a human teacher provides feedback on the model’s outputs, guiding it toward human‑aligned behavior.
Core Idea
Replace or adjust the environment’s reward with human‑derived feedback so the AI aligns with human values.
RLHF vs Traditional RL
Reward Source: Traditional RL uses automatic environment feedback (e.g., game score); RLHF uses human annotations or a reward‑model prediction.
Data Requirement: Traditional RL needs massive interaction data; RLHF needs human preference data plus interaction data.
Application Scenarios: Traditional RL fits clear‑rule tasks like game AI or robot control; RLHF is used for language‑model alignment and ethically sensitive tasks.
Advantages: Traditional RL requires no human intervention; RLHF better captures human subjective preferences.
Challenges: Designing reward functions is hard; RLHF incurs high human‑labeling cost and possible bias in the reward model.
Through RLHF, AI can not only achieve target tasks but also understand human intent and values, which is a key use case for DPO, PPO, and GRPO.
4. PPO (Proximal Policy Optimization)
Analogy
Like a fitness coach’s “safe training plan”: the trainee’s workload changes only slightly each session to avoid injury.
Core Idea
Define a “safe range” so that policy updates are small and stable, preventing catastrophic performance drops.
Key Principles
Policy Gradient: Adjust the policy based on the advantage function (how much better an action is compared to the average).
Clip Mechanism: Restrict policy change to within ±20% (ε=0.2) each update.
Critic Role: A value network evaluates long‑term effects of actions, similar to a fitness assessor.
Pros & Cons
Advantages: Stable and controllable, suitable for complex tasks such as robot locomotion or game‑AI boss fights.
Disadvantages: Requires a critic network and large training data, leading to high computational cost.
Practical Applications
ChatGPT fine‑tuning: combines PPO with a human‑feedback reward model to generate more natural responses.
5. DPO (Direct Preference Optimisation)
Analogy
Like a student improving essays directly from a teacher’s comments without a scoring rubric.
Core Idea
Skip the reward‑model step and optimise the policy directly from human‑annotated preference pairs.
Key Principles
Flaws of Traditional RLHF: Two‑stage process (reward model training → policy optimisation) is complex and error‑prone.
DPO Simplification: Directly tell the model “Answer A is better than Answer B”.
Loss Function:
L_DPO = -log σ(β·log π_θ(y_w|x) - log π_θ(y_l|x))where σ is the sigmoid, β controls optimisation strength, and π_θ denotes the policy probability.
Input Data: Preference pairs (e.g., “Answer A is clear, Answer B is off‑topic”).
Optimization Goal: Increase the probability of the preferred answer over the less‑preferred one.
Pros & Cons
Advantages: No reward‑model training, low memory usage, fast fine‑tuning for dialogue models.
Disadvantages: Relies heavily on high‑quality preference data; biased annotations can mislead the model.
Practical Applications
Dialogue model alignment – making AI refuse harmful queries.
Text‑summary optimisation – using click‑through preference signals to produce more engaging summaries.
6. GRPO (Group‑Relative Policy Optimisation)
Analogy
Like a talent‑show where contestants in the same group perform the same piece and are judged relative to each other.
Core Idea
Optimise policies by comparing candidates within a group rather than using absolute scores, reducing the need for a critic network.
Key Principles
Group‑Relative Reward: Generate multiple candidates for the same query and compare them.
Reward Normalisation: Convert group rewards to standardized scores (e.g., advantage measured in standard deviations).
Formula Example:
Group Statistics: μ_group – average reward of the group; σ_group – standard deviation of group rewards.
Omit Critic Network: Traditional PPO needs a critic to predict scores; GRPO computes advantages on‑the‑fly from group comparisons, saving memory.
Stability Controls: KL‑divergence penalty and clipping mechanism limit policy drift, similar to PPO’s clip.
Pros & Cons
Advantages: Reduces GPU memory by ~50%, suitable for resource‑constrained scenarios (e.g., mobile training); multi‑candidate generation enhances diversity.
Disadvantages: Inference requires generating several candidates, increasing latency.
Practical Applications
Mathematical reasoning – DeepSeek‑R1 improves GSM8K accuracy to 51.7% by multi‑answer ranking.
Code generation – produces several implementations and selects the most concise and efficient one.
7. Summary & Application Recommendations
Core Comparison
Goal: PPO – maximise cumulative reward; DPO – directly align with human preferences; GRPO – optimise group‑relative rewards.
Data Dependency: PPO – environment interaction or reward model; DPO – high‑quality preference pairs; GRPO – multiple candidates per query.
Computational Complexity: PPO – high (policy + value networks); DPO – low (policy only); GRPO – medium (no value network but multi‑candidate generation).
Suitable Scenarios: PPO – robot control, game AI; DPO – dialogue model fine‑tuning, text generation; GRPO – math reasoning, resource‑sensitive tasks.
Application Advice
PPO: Use for complex tasks requiring environment interaction or a reward model (e.g., robot control).
DPO: Choose when high‑quality preference data is available and rapid language‑model fine‑tuning is needed (e.g., conversational AI).
GRPO: Prefer for memory‑limited or diversity‑focused tasks such as mathematical reasoning or code generation.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.