Artificial Intelligence 23 min read

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

This article surveys the current RLHF ecosystem, comparing on‑policy methods like PPO with off‑policy approaches such as DPO, and examines recent variants—including ReMax, GRPO, DPOP, TDPO, and ORPO—highlighting their algorithmic differences, resource trade‑offs, and practical performance insights.

Baobao Algorithm Notes

May 30, 2024

What’s the Latest RLHF Landscape? From PPO to ORPO Explained

Introduction

With the open‑source release of Llama 3, the AI community has renewed focus on alignment, and Reinforcement Learning from Human Feedback (RLHF) has become a central research area. The mainstream RLHF approaches split into two families: on‑policy methods, typified by PPO, and off‑policy methods, typified by DPO.

On‑Policy vs Off‑Policy

On‑policy techniques require the language model to generate responses during training, using those generations to compute rewards and update the model. Off‑policy methods train the model on pre‑collected answer pairs without any generation step, which reduces training time but depends heavily on the quality of the provided data.

On‑Policy Pipeline (PPO)

PPO for large language models uses four components of identical size:

Actor – the model that generates text and is being trained.

Critic – a coach model that predicts the expected reward for each token, updating alongside the actor.

Reward Model – a frozen judge that assigns a scalar score to generated outputs.

Reference Model – a copy of the actor used to compute a KL‑penalty and prevent reward hacking.

When training a 70‑billion‑parameter LLM, all four models must be loaded simultaneously (≈280 B parameters), and half of them (≈140 B) are updated, which explains PPO’s high GPU memory consumption and slow training speed.

ReMax

ReMax removes the Critic entirely and lets the Actor align directly with the Reward Model. By using the greedy‑sample score as a baseline, it reduces the number of loaded models to three (≈210 B parameters) and halves the number of trainable parameters. The paper (arXiv:2310.10505) shows that this baseline variance reduction stabilizes training. ReMax: https://arxiv.org/pdf/2310.10505 ReMax also modifies the gradient computation: instead of the standard PPO advantage, it uses the difference between the sampled reward and the greedy baseline. This change cuts training time per step roughly in half, as demonstrated on a 4‑GPU A800 setup where PPO cannot run without offloading but ReMax can train Llama‑7B.

Group Relative Policy Optimization (GRPO)

GRPO retains PPO’s importance‑sampling and clipping mechanisms but replaces the Critic‑based advantage with a baseline computed by averaging multiple sampled rewards for the same prompt. This “sample‑average” baseline reduces variance without a learned Critic. The method also incorporates a KL‑penalty applied globally rather than per‑token. GRPO: https://arxiv.org/pdf/2402.03300 GRPO works best when the underlying SFT model produces relatively low‑variance outputs and when a sufficient number of samples per prompt can be drawn.

Off‑Policy Pipeline

Direct Preference Optimization (DPO)

DPO eliminates the need for a Critic and Reference Model by training directly on preference pairs (a good answer and a bad answer for the same prompt). The loss encourages the policy to increase the probability of the chosen answer while decreasing that of the rejected one.

def dpo_loss(self, policy_chosen_logps, policy_rejected_logps, reference_chosen_logps, reference_rejected_logps):
    pi_logratios = policy_chosen_logps - policy_rejected_logps
    ref_logratios = reference_chosen_logps - reference_rejected_logps
    logits = pi_logratios - ref_logratios
    losses = -F.logsigmoid(self.beta * logits)
    return losses

Paper: https://arxiv.org/pdf/2305.18290

Fixing Failure Modes of Preference Optimization (DPOP)

DPO can suffer when both good and bad answers share most of their tokens, causing the probability of the good answer to drop unintentionally. DPOP adds a regularization term that:

Reduces the update if the chosen answer’s probability under the current policy is already higher than under the SFT model.

Emphasizes lowering the rejected answer’s probability when the policy has already fit the chosen answer well.

DPOP: https://arxiv.org/pdf/2402.13228

Token‑level DPO (TDPO)

TDPO augments DPO with a forward KL penalty, encouraging the policy to stay close to the reference distribution while still optimizing preferences. The forward KL is computed as:

vocab_logps = logits.log_softmax(-1)
reference_vocab_ps = reference_logits.softmax(-1)
reference_vocab_logps = reference_vocab_ps.log()
per_position_kl = (reference_vocab_ps * (reference_vocab_logps - vocab_logps)).sum(-1)

This penalty improves output diversity compared with PPO, which uses a backward KL.

TDPO: https://arxiv.org/pdf/2404.11999

Monolithic Preference Optimization without Reference Model (ORPO)

ORPO pushes the resource reduction further by discarding the reference model altogether. Its loss combines a standard SFT cross‑entropy term with an odds‑ratio term that directly maximizes the odds of good samples over bad ones:

log_odds = (policy_chosen_logps - policy_rejected_logps) - (torch.log1p(-torch.exp(policy_chosen_logps)) - torch.log1p(-torch.exp(policy_rejected_logps)))
losses = self.beta * torch.log(torch.sigmoid(log_odds))

Paper: https://arxiv.org/pdf/2403.07691

Conclusion

All these variants aim to lower the heavy resource demands of classic PPO while preserving or even improving alignment performance. The optimal choice depends on available hardware, dataset characteristics, and whether fast iteration (off‑policy) or higher theoretical performance ceilings (on‑policy) are more important for a given application.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM reinforcement learning alignment RLHF PPO DPO

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.