From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

This article surveys the rapid evolution of reinforcement‑learning algorithms for large‑language‑model inference from early REINFORCE and PPO to newer approaches such as GRPO, RLOO, DAPO, CISPO, DPPO, ScaleRL and MaxRL, highlighting their design motivations, mathematical formulations, empirical trade‑offs and open research challenges.

Machine Heart
Machine Heart
Machine Heart
From PPO to MaxRL: The Evolution of Reinforcement Learning for LLM Inference

Reinforcement Learning Overview

Reinforcement learning (RL) has become a core component of post‑training pipelines for large language models (LLMs), driving the shift from GPT‑3 to InstructGPT and powering the current wave of inference‑level capability improvements.

The first generation of RL for LLMs was dominated by Proximal Policy Optimization (PPO), originally designed for Atari and robotics tasks and later adapted successfully to RL‑from‑Human‑Feedback (RLHF). The second generation introduced a flood of variants that differ only in minor details but have profound effects on training dynamics.

REINFORCE

REINFORCE is the simplest policy‑gradient method and the foundation for all subsequent algorithms. In the standard RL setting an agent observes a state s_t, selects an action a_t according to a policy π, receives a reward r_t, and transitions to a new state s_{t+1}. The objective is to maximize the expected discounted return.

For LLMs the environment is simplified: the policy is a parametrized model π_θ, the prompt (plus previously generated tokens) is the state, the next token is the action, and a scalar reward r(x,y) scores the whole generated response. The REINFORCE gradient is weighted by the reward, and a baseline b(x) is subtracted to reduce variance, yielding the advantage estimate r(x,y)‑b(x).

PPO

PPO became the default RLHF algorithm for several years. Its objective combines a clipped surrogate loss, an importance‑sampling ratio between the current policy and the behavior policy that generated the data, and often a KL‑penalty to keep the updated policy close to a reference model π_ref. The clipping operation creates a trust‑region mask that zeroes gradients when the ratio exceeds a threshold, preventing large policy jumps.

Key practical points:

The importance‑sampling ratio is close to 1 only on the first optimizer step after data generation; later steps treat the data as off‑policy.

Clipping affects both the loss value and its gradient w.r.t. θ, potentially skipping updates when the trust‑region is violated.

GRPO

Group‑Relative Policy Optimization (GRPO) was introduced in DeepSeek‑Math and later popularized by DeepSeek‑R1. It removes the learned value model from PPO and replaces it with a group‑wise baseline: for each prompt the algorithm samples a set of G responses, computes rewards r_i, and normalizes them within the group to obtain advantages. This eliminates the memory overhead of a separate critic and mitigates bias introduced by sequence‑level averaging.

RLOO

RLOO (Reward‑Based Leave‑One‑Out) reaches a similar conclusion: PPO’s extra components are often unnecessary for LLM fine‑tuning. For each prompt it samples K responses, computes the advantage of a response as its reward minus the average reward of the other K‑1 responses. The method is unbiased, does not divide by the group standard deviation, and reverts to a pure REINFORCE‑style update without clipping.

Dr. GRPO

Further analysis by the DeepSeek team showed that the standard sequence‑level loss averaging introduces a bias that favors short correct answers and long incorrect ones. Dr. GRPO fixes this by normalizing with a fixed constant (the maximum token count) instead of the sequence length, and by avoiding per‑prompt standard‑deviation scaling, which can over‑amplify tiny reward differences on already‑solved prompts.

DAPO

Decoupled Advantage Policy Optimization (DAPO) builds on GRPO with four refinements:

Token‑level aggregation replaces sample‑level averaging, improving the signal for each token.

Asymmetric clipping uses a higher upper bound ε_{high}=0.28 while keeping the lower bound ε_{low}=0.2, allowing rare but useful tokens to receive larger updates.

Soft penalty for slightly over‑long responses creates a gradient that discourages unnecessary length while still tolerating modest overruns.

Dynamic sampling continues generating responses for a prompt until a mix of positive and negative rewards is observed, ensuring every prompt contributes a learning signal.

CISPO

Clipped Importance‑Sampled Policy Optimization (CISPO) decouples clipping from the gradient itself. Instead of masking the entire loss, it only clips the importance‑sampling weight and applies a stop‑gradient operation to the mask. This retains the variance‑reduction benefit of weight clipping while allowing gradients for all tokens to flow, leading to more stable training and up to a 2× speed‑up compared with DAPO in MiniMax experiments.

MaxRL

Maximum‑Likelihood Reinforcement Learning (MaxRL) reframes the RL objective from maximizing expected reward (pass@1) to a family of objectives indexed by a truncation horizon T. When T=1 the method reduces to standard RL; when T=N it becomes equivalent to maximum‑likelihood estimation. The estimator averages the scores of successful trajectories, yielding an unbiased gradient that both reduces variance and moves the objective closer to maximum‑likelihood as the number of rollouts increases. Empirically MaxRL improves pass@k performance, preserves output diversity better than GRPO, and scales efficiently with compute.

DPPO

Divergence‑based PPO (DPPO) questions the trust‑region definition based on the probability ratio of sampled tokens. It proposes using a divergence measure (total variation or KL) between the full policy distributions as the trust‑region mask. Approximate binary or top‑K estimators make this tractable. Experiments show that only a tiny fraction (<0.5 %) of updates cause instability; masking those high‑divergence updates stabilizes training.

ScaleRL

ScaleRL focuses on how algorithmic choices behave when compute is scaled dramatically (over 400 k GPU‑hours of ablations). Key findings include:

Asynchronous generation‑while‑training pipelines improve hardware utilization without hurting final performance.

Loss‑type comparisons show CISPO and GSPO outperform DAPO at convergence.

Using FP32 logits for the model head reduces mismatch between generation and training kernels.

Prompt‑level loss aggregation yields better results than sample‑level averaging.

Zero‑variance filtering (dropping prompts that are all‑correct or all‑wrong) speeds up training.

These results clarify the trade‑offs between early learning speed and asymptotic performance.

Summary of Method Differences

The table below (originally an image) compares the main design choices of each algorithm, such as the presence of a critic, baseline computation, clipping strategy, and trust‑region definition.

Method comparison table
Method comparison table

Open Challenges

Despite rapid progress, several fundamental problems remain:

Credit assignment : Current reward‑to‑token schemes distribute the same scalar reward across all tokens, which is inefficient for correcting specific failure points.

Sample efficiency : RL for LLMs typically requires 8–64 rollouts per prompt, making training costly. Better reuse of failed samples and smarter prompt selection are open research directions.

Generalization beyond math and code : Most breakthroughs rely on low‑cost, binary‑reward tasks. Extending these methods to noisy, delayed, or multi‑turn interactive settings is still challenging.

Overall, algorithmic innovation is no longer the bottleneck; the community now focuses on efficiency, robustness, and scalability of existing RL‑for‑LLM pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMreinforcement learningRLHFGRPOPPOalgorithm comparisonMaxRL
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.