Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

This article provides a thorough, four‑part overview of RLHF for large language models, covering preference‑optimization algorithms (PPO‑based and offline RL approaches), reward‑model training techniques, inference‑time exploration strategies, and practical implementation details including the OpenRLHF framework and resource‑allocation tricks.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Unraveling RLHF: From PPO to DPO and Beyond – A Comprehensive Guide

RLHF for Large Language Models: Technical Survey

This article surveys recent algorithmic advances, practical challenges, and open‑source tooling for reinforcement‑learning‑from‑human‑feedback (RLHF) applied to large language models (LLMs).

1. Preference Optimization Algorithms

PPO‑based pipelines first train a reward model (RM) from human preference data, then optimize the policy with Proximal Policy Optimization (PPO). The standard PPO objective includes a KL‑penalty that keeps the updated policy close to a reference model. In LLM settings the pipeline requires four models (actor, reference, reward, critic) and incurs high memory and scheduling costs. Several cost‑reduction tricks are used:

Allocate different numbers of GPUs to each model.

Offload parts of the model to CPU or NVMe.

Use VLLM for fast batched generation.

Apply LoRA adapters to reduce the number of trainable parameters.

Offline‑RL / Direct Preference Optimization (DPO) family treats the preference data as a bandit problem and directly optimizes the policy without an explicit RM. Variants include token‑level DPO, XPO, GRPO, and multi‑turn DPO. Known issues are gradient imbalance between chosen and rejected examples and a tendency for the chosen log‑probability to drop during training. Recent fixes add baseline subtraction, symmetric loss terms, or KL‑regularization to bring DPO closer to classic RL formulations.

2. Reward Model Training

Explicit RM training still outperforms DPO on out‑of‑distribution queries. RM designs include:

Pairwise scoring (chosen vs. rejected).

Multi‑turn or step‑level scoring.

Token‑level scoring where only the EOS token receives a reward and intermediate tokens receive only the KL baseline.

Loss families that address scaling, bias, and multi‑objective weighting are:

ORPO – orthogonal regularization of the policy gradient.

SimPO – adds length‑normalization to the reward.

IPO – incorporates prospect‑theory‑style anchoring.

KTO – a KL‑regularized objective that treats the reference model as a prior.

List‑wise ranking losses – extend pairwise losses to full rankings.

Ensembling multiple RMs (averaging, uncertainty‑weighted fusion, or weighted‑sum) reduces over‑optimization at the cost of extra memory; LoRA can mitigate this overhead.

3. Inference‑time Exploration

Beyond training, several inference‑time strategies improve answer coverage:

Best‑of‑N sampling – generate N candidates and select the highest‑scoring one.

Monte‑Carlo Tree Search (MCTS) or advanced beam‑search variants – guide generation toward high‑reward regions with fewer samples.

Parallel exploration scaling – run many generations concurrently to increase throughput.

Empirical studies show that broader sampling raises the probability of finding correct answers, but current RMs often fail to rank them correctly, motivating stronger RM architectures.

4. Open‑Source Frameworks and Implementation Details

The OpenRLHF framework ( https://github.com/OpenRLHF) is a Ray‑based RLHF library that:

Allows fine‑grained GPU allocation for the actor, reference, reward, and critic models.

Integrates VLLM for high‑throughput generation.

Supports offloading and LoRA‑based ensembles to increase batch size while keeping memory usage low.

Key implementation nuances observed in large‑scale experiments:

Preserve the full query when truncating inputs; drop only the trailing segment to avoid “garbage‑in, garbage‑out”.

Handle whitespace tokens explicitly to keep tokenization consistent.

Score only the EOS token in the reward model; intermediate tokens receive only the KL baseline.

LoRA adapters improve parameter efficiency but may slightly degrade reward‑model learning.

When GPU resources are limited, practitioners must trade off batch size against full‑parameter updates. Removing the critic (as in GRPO or MDLOO) reduces memory but may affect variance reduction.

5. Practical Observations and Open Problems

• DPO often exhibits a larger gradient from negative examples, causing the chosen log‑probability to decline. Mitigations include adding an SFT‑style loss for chosen examples or scaling the loss when the policy’s log‑probability falls below the reference.

• Length bias: human preferences favor detailed answers, leading RMs to reward longer generations. Normalizing reward by response length or adding a length penalty in the loss can curb runaway verbosity.

• Over‑generalization: strong alignment can cause the model to refuse benign queries. Incorporating boundary queries with weighted SFT loss helps preserve useful capabilities.

• Reference‑model replacement: periodically updating the reference weights (e.g., every N steps) yields better performance than simply relaxing the KL coefficient.

• Reward‑model reliability: scoring only at EOS can make intermediate token rewards noisy. Some works train step‑level or token‑level RMs, or combine them with auxiliary SFT losses to retain semantic knowledge.

• Exploration vs. reward‑model quality: increasing sampling diversity improves coverage, but without a robust RM the correct answer may not be identified. Hence, stronger RMs and better ranking losses are essential for effective inference‑time exploration.

Overall, the survey highlights that while PPO remains a strong baseline for RLHF, offline‑RL approaches like DPO and advanced reward‑model designs are rapidly closing the gap, especially when combined with efficient infrastructure such as OpenRLHF, VLLM, and LoRA.

reinforcement learningRLHFPPODPOLLM optimizationreward modelingOpenRLHF
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.