A 2026 Survey of LLM‑Focused RL: From PPO to DPO, GRPO, and Multi‑Agent RL
This article reviews the five‑year evolution of reinforcement‑learning techniques for large language models, comparing PPO, DPO, GRPO and emerging multi‑agent approaches, analyzing their reward signals, practical trade‑offs, and the open‑source frameworks that support them.
Five‑Year Evolution of LLM‑Centric RL
Reinforcement learning (RL) was once a niche sub‑field for games and robotics, but after the release of ChatGPT it became the bridge between "smart" foundation models and useful products. In roughly five years the entire training pipeline has been rewritten at least three times, and the nature of the reward signal has changed even more dramatically than the algorithms themselves.
60‑Second History
1989 – Q‑learning, the value‑based RL cornerstone.
1992 – REINFORCE, the policy‑gradient cornerstone.
2013‑2015 – DQN beats humans on Atari, marrying RL with deep learning.
2016 – AlphaGo defeats Lee Sedol.
2017 – OpenAI publishes PPO (Proximal Policy Optimization), which becomes the default RL algorithm for the next five years.
2017 – AlphaZero achieves self‑play mastery without human data.
2022 – InstructGPT adapts PPO to fine‑tune language models with human preferences; ChatGPT launches shortly after.
All current LLM‑RL work builds on the PPO + reward‑signal lineage.
PPO + RLHF: The Starting Point
The InstructGPT paper formalized the pipeline:
SFT – fine‑tune the base model on a small set of human‑written demonstrations.
Reward Model (RM) – present annotators with two model outputs, ask which is better, and train a model r(x, y) to predict that preference.
PPO – treat the RM as the environment, sample responses, score them with the RM, and update the policy with PPO while adding a KL penalty that keeps the new policy close to the SFT policy.
The objective maximizes the expected RM score minus a KL term weighted by a hyper‑parameter β, which is the most frequently tuned knob. The KL term prevents the policy from collapsing into a high‑reward but nonsensical distribution.
InstructGPT showed that a 1.3 B PPO‑fine‑tuned model can outperform a 175 B GPT‑3 baseline in human preference alignment.
Practical Pitfalls of PPO + RLHF
Four models in GPU memory – policy, frozen reference policy, reward model, and value (critic) network. A 70 B policy can require roughly 280 B parameters when weights and optimizer states are combined.
Reward hacking – the policy learns to exploit any weakness in the RM (e.g., producing long lists, markdown headings) as long as the RM assigns a high score.
Distribution shift – the RM is trained on SFT outputs; as the policy drifts, the RM becomes less reliable, a problem not visible on the loss curve.
Hyper‑parameter fragility – clipping ratio, KL coefficient, value‑loss weight, learning rate, group size, rollout batch size; a single mis‑tuned value can silently degrade training.
Thus PPO + RLHF is powerful but its cost is primarily engineering rather than mathematical.
When PPO Still Makes Sense
Even after newer methods appeared, PPO remains the right choice when:
Exploration (mathematics, code, long‑range reasoning) is required, not just preference imitation.
A high‑quality, stable reward model or verifier is available.
The GPU budget can hold all four models simultaneously.
ICML 2024 ("Is DPO Superior to PPO for LLM Alignment?") reported that with equal data quality PPO still outperforms DPO by ~2.5 % on math tasks and ~1.2 % on general benchmarks.
DPO: Direct Preference Optimization
Rafailov et al. (2023) introduced Direct Preference Optimization, which removes the explicit reward model. Under the standard RLHF assumption (Bradley–Terry preference model with a KL‑regularized objective), the optimal policy and the implicit reward function have a closed‑form relationship. DPO replaces the two‑step process (learn RM → PPO) with a single supervised loss defined on preference triples (prompt, chosen, rejected):
The loss is a standard cross‑entropy applied to the logit difference between the chosen and rejected responses. No reward model, rollout, critic, or PPO loop is needed.
Cheaper – 2–4× less compute than PPO because rollouts are unnecessary.
More stable – the loss is purely supervised; training curves are easy to interpret.
Style shaping – DPO can influence refusal behavior, tone, formatting, and chit‑chat usefulness.
β matters – too low lets the policy drift; too high prevents movement. Practitioners typically use β in [0.1, 0.5].
Iterative DPO – re‑sample preference pairs from the latest policy for multi‑round improvement.
Limitation: DPO does not explore beyond the data distribution; if the correct answer never appears in the dataset, DPO cannot invent it.
GRPO: Group Relative Policy Optimization
DeepSeek introduced GRPO (2024) to eliminate the critic network. Instead of a learned value function, GRPO uses the other rollouts in the same prompt as a baseline.
For a prompt x, GRPO samples a group of G rollouts y₁…y_G (typically G = 8–64), scores each with a verifier (often a binary or test‑based reward), computes a group‑wise normalized advantage, and applies a PPO‑style clipped objective with a KL penalty to a reference policy.
No critic – memory usage drops by roughly half; a 7 B model that needs 8 H100 GPUs with PPO can run on 4 H100 with GRPO.
Natural fit for verifiable rewards – binary correctness signals (e.g., unit‑test pass/fail) produce clean contrastive signals within the group.
Stable advantage – group normalization mitigates reward‑scale issues.
Works well for reasoning tasks – long‑chain thinking, large‑scale models, and strong verifiers (DeepSeek‑R1, Qwen, OLMo 3) benefit from GRPO.
Common failure points for first‑time GRPO runs:
Group size G – larger G reduces variance but linearly increases rollout cost; most public configs use G = 16–32.
All‑zero or all‑one groups – when every sample succeeds or fails, the standard deviation is zero, causing exploding or vanishing advantages. Adding ε to the denominator and filtering degenerate prompts helps.
KL coefficient – β too low leads to incoherent language; DeepSeek typically uses β ∈ [0.001, 0.04].
Reward form – binary vs. dense rewards behave very differently; process‑level rewards require careful design.
Variants such as DAPO, GSPO, and Dr. GRPO are minor tweaks that keep the core idea of using a rollout group as a baseline.
Reward‑Signal Evolution
PPO + RLHF (2022‑2023) – reward comes from a human‑trained RM; the signal captures "which answer humans prefer". Failure modes: flattery, reward hacking. Bottleneck: human annotators.
DPO (2023‑2024) – reward is applied directly to preference pairs, removing the RM. Failure mode: lack of exploration. Bottleneck: quality of preference data.
GRPO + RLVR (2024‑2026) – reward comes from verifiable judges (unit tests, theorem provers, regex). The signal captures "is the answer provably correct". Failure modes: verifier hacking, capability tunnel‑vision. Bottleneck: verifier design.
GRPO + LLM‑judge (2025‑present) – a stronger model acts as the judge, scoring whether the answer looks correct to a smarter LLM.
The dominant paradigm today is RLVR (Reinforcement Learning with Verifiable Rewards), powering families such as DeepSeek‑R1, GPT‑5, Claude‑with‑thinking, the "o" series, and Gemini‑Thinking.
Process‑vs‑Outcome Training
Two reward families are distinguished:
Outcome Reward Model (ORM) / Result‑oriented – a scalar attached to the final answer (often binary: test pass/fail, exact match, correct SQL row). Can be graded.
Process Reward Model (PRM) / Step‑oriented – each reasoning step receives a score, typically via a separate classifier trained on step‑level human annotations.
From a credit‑allocation perspective, PRM appears superior because a single mistake in a long chain can be locally corrected. However, PRM requires costly step‑level labels.
OpenAI's "Let's Verify Step by Step" (2023) trained a PRM on PRM800K (tens of thousands of human‑annotated math steps) and achieved noticeable gains on best‑of‑N sampling for MATH benchmarks. DeepSeek‑R1 later adopted a simple result‑reward + GRPO pipeline, avoiding PRM entirely while still showing strong reasoning.
When to use PRM:
If failure rollouts mostly have correct structure but stall at a specific intermediate error, PRM can help.
If failures are diverse, a larger set of result‑reward rollouts is usually more effective.
If PRM is required, generative PRM approaches (e.g., ThinkPRM) are preferred over classic discriminative classifiers because they amortize annotation cost.
Multi‑Agent RL for LLMs
Research labs are increasingly exploring multi‑agent reinforcement learning (MARL) where the environment consists of other models.
Self‑play – SPIRAL uses zero‑sum games (tic‑tac‑toe, Kuhn Poker, simple negotiation) to train a single LLM, reporting up to 10 % improvement on eight reasoning benchmarks.
Co‑evolutionary roles – SAGE runs four specialized agents (Challenger, Planner, Solver, Critic) with minimal seed data, achieving +8.9 % on LiveCodeBench and +10.7 % on OlympiadBench.
Value decomposition – Agent Q‑Mix (CTDE paradigm) treats agent communication as a cooperative MARL problem, reporting 20.8 % on Humanity's Last Exam, surpassing hand‑crafted pipelines.
Credit assignment is the central challenge in MARL. Three practical levers have emerged:
Process rewards – train a verifier that scores each agent’s contribution (e.g., does the Planner produce a valid sub‑problem?).
Value decomposition (VDN / QMIX / COMA family) – learn a joint value function and decompose it per agent.
Trajectory decomposition – view the multi‑agent system as a POMDP and decompose the full trajectory into state‑action‑reward transitions, propagating credit through the graph (e.g., LightningRL).
Pure result‑reward MARL is safe only when the team is tiny (2‑3 agents), trajectories are short, and enough team‑level rollouts are collected to statistically separate contributions.
Training Real Agents: Framework Landscape
Most production teams do not train agents from scratch; they stitch together existing tool‑calling frameworks (LangChain, AutoGen, CrewAI, OpenAI Agent SDK, Microsoft Agent Framework) and then attach a GRPO loop.
Approach 1 – Framework‑agnostic, Observability‑driven (Agent‑Lightning)
Agent‑Lightning (Microsoft Research, open‑source Aug 2025, v0.3.0) treats the agent as a black box, captures interactions via observability hooks, and converts traces into standard state‑action‑reward tuples for training. It consists of three components:
Algorithm – decides tasks and learning mode (RL, APO, SFT).
Runner – executes the agent in the existing framework without modification.
LightningStore – shared storage and message queue that coordinates the algorithm and runner.
LightningRL implements hierarchical credit assignment for multi‑step trajectories, enabling selective optimization of a single agent within a multi‑agent system.
Approach 2 – Step‑level MDP, End‑to‑End (Agent‑R1)
Agent‑R1 (University of Science and Technology of China, open‑source Mar 2026, v0.1.0, ~1.4k GitHub stars) models each interaction step as a first‑class RL transition with its own state, action, and observation, rather than a long token sequence. Key design choices:
Native process‑reward support – integrates PRIME‑style reward normalization, allowing clean combination of step‑level and result rewards.
Custom optimizer path – hosts novel algorithms such as PSPO (Proximal Sequence Policy Optimization) that align token‑level optimization with step‑level agent interactions.
Agent‑R1 builds on the distributed training engine verl, and its ecosystem includes PaperScout (academic paper search), TableMind (tool‑augmented table reasoning), and Cast‑R1 (agent‑based time‑series prediction).
Both frameworks share underlying engines: verl (ByteDance’s distributed RLHF/GRPO/agent‑RL backbone), OpenRLHF (early generic RLHF framework), and TRL (Hugging Face’s go‑to tool for DPO and PPO on modest scales).
Choosing the Right Stack
Politeness & style – SFT → DPO: cheap, stable, good for tone shaping.
Refusal or safety behavior – DPO: preference pairs fit naturally.
Mathematics, code, logical reasoning – GRPO + RLVR with result rewards: verifiable signals dominate, and PRM annotation cost is avoided.
End‑to‑end tool‑agent training – GRPO on agent trajectories; pick Agent‑Lightning if you already have a LangChain/AutoGen/CrewAI stack, or Agent‑R1 if you want a ground‑up step‑level MDP.
Exploration without a critic – GRPO provides PPO‑style exploration with half the memory.
Large, high‑quality RM & GPU budget – PPO remains optimal for some hard tasks.
Multi‑role workflows (planner/solver/critic) – start with single‑agent RL, then graduate to MARL; allocate budget for step‑level or agent‑level rewards because team‑level result rewards cannot back‑propagate cleanly.
Future Directions
RLHF will not disappear; it will become a thin specialization layer for style, brand voice, and refusal behavior, while most alignment moves toward verifiable signals.
Validator engineering will emerge as a distinct discipline (sandbox engineers, judge designers, reward calibrators).
Language‑model AlphaZero is arriving: strong foundation + self‑play + verifier + tree search.
Long‑horizon agent RL (multi‑day browsing, coding, experimentation) is the next leap, enabled by RLVR and the emerging agent‑RL frameworks.
Open‑source stacks (TRL, OpenRLHF, verl, Open‑Instruct, Agent‑Lightning, Agent‑R1, RAGEN, MARTI, FlexMARL) are narrowing the gap with well‑funded labs.
Reward hacking will become the central alignment challenge as models outsmart imperfect validators.
Summary of What Has Been Removed Over Time
TRPO removed fragility.
PPO removed second‑order math.
DPO removed the reward model.
GRPO removed the critic.
Result rewards removed the need for step‑level annotation in single‑agent settings.
Agent‑RL frameworks (Agent‑Lightning, Agent‑R1, verl) removed the requirement to rewrite agents for training.
MARL is removing static environments.
The remaining components are a learner, a set of peer learners, and a verifiable signal. If RL is always treated as a side‑track to pre‑training and fine‑tuning, the next breakthrough in LLM capability will come not from a bigger transformer but from a smarter training loop surrounding it.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
