Understanding Large‑Model Reinforcement Learning: Algorithms, Frameworks, and Emerging Trends

This article surveys five years of large‑model reinforcement learning, detailing the evolution from PPO + RLHF to DPO and GRPO, comparing reward‑model‑based and verifiable‑reward approaches, discussing multi‑agent extensions, and evaluating open‑source frameworks for training LLM‑driven agents.

Data Party THU
Data Party THU
Data Party THU
Understanding Large‑Model Reinforcement Learning: Algorithms, Frameworks, and Emerging Trends

60‑second History

1989 – Q‑learning, the value‑based RL foundation.

1992 – REINFORCE, the policy‑gradient cornerstone.

2013–2015 – DQN surpasses humans on Atari, marrying RL with deep learning.

2016 – AlphaGo defeats Lee Sedol.

2017 – OpenAI releases PPO (Proximal Policy Optimization), which becomes the default RL algorithm for the next five years.

2017 – AlphaZero demonstrates self‑play without human data.

2022 – InstructGPT adapts PPO for human‑preference fine‑tuning; ChatGPT launches shortly after.

All modern LLM alignment work builds on PPO plus a reward signal.

PPO + RLHF: The Beginning

InstructGPT popularized a three‑step pipeline:

SFT – fine‑tune the base model on a small set of human‑written demonstrations.

Reward Model (RM) – show annotators two model outputs, ask which is better, and train a model r(x,y) to predict the preference.

PPO – treat the RM as the environment, sample responses, score them with the RM, and update the policy with PPO while adding a KL penalty to keep the new policy close to the SFT policy. The KL term is weighted by a hyper‑parameter β, the most frequently tuned knob.

The objective is to maximize the expected RM score while preventing collapse to high‑reward but nonsensical outputs.

Limitation – PPO + RLHF is a pipeline, not a single algorithm; its cost is primarily engineering, not mathematical.

When PPO Still Makes Sense

Tasks requiring exploration (math, code, long‑range reasoning) rather than pure imitation.

Availability of a high‑quality, stable reward model or a trusted validator.

Sufficient GPU memory to keep four models (policy, frozen reference policy, reward model, critic) in VRAM (e.g., a 70B policy can require ~280B‑equivalent parameters).

ICML 2024 ("Is DPO Superior to PPO for LLM Alignment?") reported that with equal data quality PPO outperforms DPO by ~2.5 % on math benchmarks and ~1.2 % on general benchmarks, confirming PPO is still competitive in simple settings.

DPO: Removing the Reward Model

Direct Preference Optimization (Rafailov et al., 2023) eliminates the separate reward model. Under the standard RLHF assumptions (Bradley–Terry preference model with KL‑regularized objective), the optimal policy and an implicit reward function have a closed‑form relationship. DPO replaces the two‑step (RM → PPO) process with a single supervised loss defined on preference triples (prompt, chosen, rejected):

image
image

Key properties:

Cheaper – same data, but 2–4× less compute because no rollouts.

More stable – a pure supervised loss; training curves are easy to monitor.

Style shaping – can influence refusal behavior, tone, formatting, and chit‑chat usefulness.

β importance – too low lets the policy drift; too high freezes it. Practitioners typically use β in [0.1, 0.5].

Multi‑turn possible – iterative DPO that resamples preferences with the latest policy yields far better results than a single pass.

Limitation – DPO does not explore. If the correct answer never appears in the dataset, DPO cannot invent it, making it unsuitable for tasks that require discovery (e.g., novel math proofs).

GRPO: Removing the Critic

Group Relative Policy Optimization (GRPO), introduced by DeepSeek, discards the learned value function and uses the other rollouts in the same batch as a baseline. The procedure for a single prompt x is:

Sample a group G of rollouts y₁…y_G from the current policy (typical G = 8–64).

Score each rollout with a verifier r_i = R(x, y_i) (the verifier is a non‑learned reward function).

Compute a normalized advantage within the group.

Apply a PPO‑style clipped objective with a KL penalty to a frozen reference policy.

image
image

Benefits:

No critic – memory usage drops by roughly half; a 7B policy that needed 16 × H100 for PPO can run on 8 × H100 with GRPO.

Natural fit for verifiable rewards – binary (pass/fail) signals produce clean intra‑group contrasts.

Stable advantage – group normalization mitigates reward‑scale issues.

Works well for reasoning tasks – long‑chain thinking, large‑G sampling, and a good verifier are the backbone of many 2025‑2026 open‑source inference models (DeepSeek‑R1, Qwen, OLMo 3, etc.).

Practical pitfalls when first running GRPO:

Group size G – larger G reduces variance but linearly increases rollout cost; most public configs use G = 16–32.

All‑zero or all‑one groups – if every sample succeeds or fails, the standard deviation is zero, causing exploding or vanishing advantages. Adding ε to the denominator and filtering degenerate prompts mitigates this.

KL weight – β too low lets the policy drift into incoherent language; DeepSeek typically uses β ∈ [0.001, 0.04] depending on the training stage.

Reward shape – binary vs. dense rewards lead to dramatically different behaviours; careful selection is required.

Note – DAPO, GSPO, Dr. GRPO and other minor variants are small improvements that keep the core idea of using a rollout group as a baseline.

Evolution of Reward Signals

Three eras:

PPO + RLHF (2022–2023) – reward comes from a human‑preference‑trained RM; failure modes include flattery and reward hacking; bottleneck is human annotators.

DPO (2023–2024) – reward acts directly on preference pairs; same failure mode (no exploration) and bottleneck shifts to preference‑data quality.

GRPO + RLVR (2024–2026) – reward comes from a verifier (unit tests, judges, regular expressions); failure modes include verifier hacking and tunnel‑vision; bottleneck is verifier design.

Current dominant paradigm is RLVR (Reinforcement Learning with Verifiable Rewards), powering models such as DeepSeek‑R1, GPT‑5, Claude‑with‑thinking, Gemini‑Thinking, etc. The signal is no longer “human gave a 7/10”, but “unit test passed” or “answer matches the reference”.

Process‑vs‑Outcome Training

Two reward forms:

Outcome Reward Model (ORM) / End‑answer – a scalar attached to the final answer (often binary: test passed / failed).

Process Reward Model (PRM) / Step‑wise – each reasoning step receives a score, typically from a classifier trained on step‑level human annotations.

From a credit‑allocation perspective, PRM appears superior because a single mistake in a long chain can be locally corrected. However, OpenAI’s "Let’s Verify Step by Step" (2023) showed that a PRM trained on millions of annotated math steps outperformed ORM on best‑of‑N sampling for MATH‑type problems, leading many to view PRM as the direction.

DeepSeek‑R1 later adopted a simple result‑reward + GRPO pipeline (no PRM), yet the resulting model still exhibited strong step‑by‑step reasoning, demonstrating that result‑reward can implicitly provide process‑level signals.

PRM vs. ORM is not the same as dense vs. sparse reward; both can be dense or sparse. The distinguishing factor is step‑wise scoring.

LLM‑focused MARL

Research is moving toward multi‑agent RL for LLMs.

Self‑play for reasoning – SPIRAL uses zero‑sum games (tic‑tac‑toe, Kuhn poker, simple negotiation) to train a single LLM; reports up to 10 % gains on eight reasoning benchmarks.

Co‑evolutionary role agents – SAGE runs four specialized agents (Challenger, Planner, Solver, Critic) in a closed loop; shows +8.9 % on LiveCodeBench and +10.7 % on OlympiadBench.

Agent Q‑Mix – under the CTDE paradigm, treats agent communication as a cooperative MARL problem; achieves 20.8 % improvement on Humanity’s Last Exam with Gemini‑3.1‑Flash‑Lite.

Credit‑assignment is the core challenge: in a team of agents, a team‑level reward (1 or 0) does not tell which agent contributed what. Three practical levers:

Process rewards per agent or per step (train a validator that scores each role’s output).

Value decomposition (VDN / QMIX / COMA) to split a joint value function into per‑agent contributions.

Trajectory decomposition (LightningRL) that treats the whole rollout as a POMDP and back‑propagates advantage through the trajectory graph.

Pure result‑reward MARL is only safe when (a) the team is tiny (2‑3 agents), (b) trajectories are short, and (c) enough team‑level rollouts are collected to statistically separate contributions.

Credit‑assignment in MARL remains unsolved; pick the method that matches your failure mode and be ready to iterate.

Training Real Agents: Framework Landscape

By 2025‑2026 most production agents are assembled from libraries such as LangChain, AutoGen, CrewAI, or Microsoft’s Agent Framework, then fine‑tuned with RL. Two emerging frameworks address the engineering gap:

Idea 1 – Framework‑agnostic, observability‑driven (Agent‑Lightning)

Agent‑Lightning (Microsoft Research, open‑sourced Aug 2025, v0.3.0) treats the agent as a black box, captures interactions via observability hooks, and converts traces into standard state‑action‑reward tuples.

Algorithm – decides which tasks to run and which learning algorithm (RL, APO, SFT) to apply.

Runner – executes the agent using the existing framework unchanged.

LightningStore – shared storage and message queue that coordinates algorithm and runner.

LightningRL implements hierarchical credit‑assignment on multi‑step trajectories, enabling selective optimization of a single agent within a multi‑agent system while mixing RL, APO, and SFT.

Best for teams that already have LangChain/AutoGen/CrewAI agents and want to start training without rewriting code.

Idea 2 – Step‑level MDP, end‑to‑end (Agent‑R1)

Agent‑R1 (USTC, open‑sourced Mar 2026, v0.1.0, ~1.4k ★ on GitHub) models each interaction as a first‑class RL transition (state, action, observation) rather than a long token sequence.

Native support for process rewards; combines them with result rewards via the PRIME normalization heuristic.

Custom optimizer pipeline; recent work includes PaperScout’s PSPO (Proximal Sequence Policy Optimization) that aligns token‑level optimization with sequence‑level interactions.

Built on the distributed training engine verl (ByteDance), and interoperable with OpenRLHF and TRL for smaller‑scale experiments.

Suitable for teams that want to build a tool‑oriented agent from scratch and need full control over environment definition, step structure, and reward shaping.

Related open‑source stacks: verl – the de‑facto distributed RLHF/GRPO/agent‑RL backbone (2025‑2026).

OpenRLHF – earlier generic RLHF framework, still widely used for single‑policy training.

TRL (Hugging Face) – primary tool for DPO and PPO on medium‑scale models.

Earlier frameworks include RAGEN, MARTI / MARTI‑v2, FlexMARL, MARL‑GPT, each influencing the current ecosystem.

Choosing Between Them

Politeness & style – start with SFT then DPO (cheap, stable).

Refusal or safety behavior – DPO (preference pairs fit naturally).

Math, code, logical reasoning – GRPO + RLVR with result rewards (verifiable signals dominate, no need for costly PRM).

End‑to‑end tool‑agent training – apply GRPO on agent traces; pick Agent‑Lightning if you have existing LangChain‑style agents, or Agent‑R1 if you are building from scratch.

Exploration without a critic – GRPO (half the VRAM of PPO).

Large, high‑quality RM and ample GPU budget – PPO remains the top choice for hard tasks.

Multi‑role workflows (planner, solver, critic) – start with single‑agent, then graduate to MARL; allocate budget for step‑wise or per‑agent rewards because team‑level rewards cannot back‑propagate cleanly.

A typical post‑training stack in 2026 looks like the diagram below (image).

image
image

Future Directions

RLHF will persist as a thin, specialized layer for style, tone, brand voice, and refusal behavior; the bulk of alignment will shift to verifiable rewards.

Validator engineering will become its own discipline (sandbox engineers, judge designers, calibration specialists).

Language‑model AlphaZero will materialize: strong base model + self‑play + verifier + tree search.

Long‑horizon agent RL (multi‑day agents that browse, code, experiment, and revise) is the next leap.

Open‑source stacks (TRL, OpenRLHF, verl, Agent‑Lightning, Agent‑R1, etc.) will continue to narrow the gap with well‑funded labs.

Reward hacking will become a central alignment problem as models outsmart imperfect validators.

Conclusion

Over the past five years the community has been “deleting” components to simplify training:

TRPO removed fragility.

PPO removed second‑order math.

DPO removed the reward model.

GRPO removed the critic.

Result rewards removed the need for step‑wise annotation in single‑agent settings.

Agent‑RL frameworks (Agent‑Lightning, Agent‑R1, verl) removed the requirement to rewrite agents for training.

MARL is removing static environments.

The remaining ingredients are a learner, a set of peer learners, and a verifiable signal. As long as RL is viewed only as a side‑track to pre‑training and fine‑tuning, the next breakthrough in LLM capability will come not from a bigger Transformer but from a smarter training loop surrounding it.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMRLHFGRPOPPOAI AlignmentDPOMulti-Agent RL
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.