Demystifying RLHF and PPO for Large Language Models: Theory and Practice
This article explains why Reinforcement Learning from Human Feedback (RLHF) is crucial for LLM intelligence, outlines the three-stage training pipeline, details InstructGPT's reward model and PPO optimization, and provides a practical guide to implementing RLHF with deep‑learning frameworks.
Large language models (LLMs) such as ChatGPT are trained in three sequential stages:
Pre‑training : unsupervised next‑token prediction on massive text corpora.
Supervised Fine‑Tuning (SFT) : the model is fine‑tuned on a curated set of human‑written prompt‑response pairs. The data are diverse but limited because annotation is costly.
Reinforcement Learning from Human Feedback (RLHF) : for each prompt the model generates multiple responses; humans rank them, producing preference labels. A Reward Model (RM) is trained on these rankings and then used as the reward function in Proximal Policy Optimization (PPO) to further improve the SFT model.
Pre‑training imposes the highest compute barrier, RLHF the highest algorithmic barrier, while SFT is comparatively lightweight. The first stage largely determines the model’s capability ceiling; RLHF aims to reduce the “alignment tax” (a drop in general ability caused by alignment) while steering the model toward human preferences.
RLHF in InstructGPT
InstructGPT’s RLHF consists of two sub‑steps:
Training a Reward Model (RM).
Combining the RM with the SFT model to form a reward function and optimizing the policy with PPO.
Datasets
The numbers of prompts used for each stage are shown below. Each RM prompt is paired with 4–9 model responses that are ranked by annotators.
Training the Reward Model
Goal : assign a scalar score to a prompt‑response pair that matches human preference.
Model architecture : In the original InstructGPT experiments GPT‑3 variants of 1.3 B, 6 B and 175 B parameters were explored. The 6 B model is commonly used for stability; the final unembedding layer is replaced by a linear head that projects the hidden state of the last token to a single scalar.
Initialization : The 6 B GPT‑3 is first fine‑tuned on a mixture of public QA and reasoning datasets (ARC, BoolQ, CoQA, DROP, MultiNLI, OpenBookQA, QuAC, RACE, Winogrande). Starting from a pretrained checkpoint or an SFT checkpoint yields similar performance.
Training procedure : Instead of the naïve pairwise cross‑entropy that compares two responses at a time (which overfits and is inefficient), the RM processes all K responses for a prompt in a single forward pass and computes a loss over every pair. This reduces the number of forward passes from O(K²) to O(K). The loss (illustrated below) is a summed cross‑entropy over all ordered pairs, where the better‑ranked response should receive a higher score.
After training, a bias term is subtracted from the RM outputs so that the average reward on human‑written demonstrations is zero (the exact calibration set is not disclosed).
Reinforcement Learning Objective
The PPO objective maximizes the expected advantage while penalizing deviation from the SFT policy:
where π_θ is the policy (the RL‑trained LLM) and π_ref is the frozen SFT model. The first term encourages high reward while adding a per‑token KL penalty KL(π_θ‖π_ref) to keep the policy close to the SFT baseline (acting as an entropy bonus and stabilizer). An optional second term λ·L_{ptx} adds a standard next‑token prediction loss on the original pre‑training data (PPO‑ptx).
PPO Algorithm Overview for LLMs
PPO is an actor‑critic method that reuses sampled trajectories by clipping the probability ratio between the new and old policies. The clipped surrogate objective prevents large policy updates that could destabilize training.
LLM‑Specific PPO Setup
Four networks are involved in the RLHF loop:
Actor : the LLM initialized from the SFT checkpoint ( π_θ).
Critic : a value‑function network initialized from the Reward Model; it estimates the expected return for each state.
Frozen SFT model : provides the KL‑penalty term by computing log π_ref for each token.
Frozen Reward Model : supplies the scalar reward for each generated response.
The actor is updated with PPO using the combined reward (RM output minus KL penalty) while the critic learns to predict the value function.
DeepSpeed‑Chat Integration
DeepSpeed‑Chat implements the above architecture by wiring the four networks together, enabling efficient large‑scale RLHF training. The schematic is shown below.
With this pipeline, practitioners can reproduce the RLHF process described in InstructGPT: collect human rankings, train a scalar reward model, calibrate its outputs, and then run PPO with KL regularization to obtain a policy that aligns with human preferences while preserving as much of the pre‑training capability as possible.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
