How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention
This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.
Reinforcement Learning with Human Feedback (RLHF)
RLHF fine‑tunes large language models (LLMs) to align their behavior with human preferences and instruction following.
Human Preference Data
Binary comparison data : collected to maximize prompt diversity.
Annotation process : annotators write a prompt, generate two responses (different models or temperature settings), choose the preferred one, and optionally rate the preference as “clearly better”, “better”, “slightly better”, or “almost equal/unsure”.
Metrics : focus on helpfulness (how well the response satisfies the request) and safety (absence of harmful content).
Data are collected weekly and incorporated into successive Llama 2‑Chat fine‑tuning cycles.
Reward Model (RM)
The RM receives the prompt, response, and prior context and outputs a scalar score reflecting helpfulness and safety. Two separate RMs are trained (helpfulness and safety), both initialized from the pretrained Llama 2‑Chat checkpoint.
Reward Model Initialization
Start from the Llama 2‑Chat checkpoint to inherit pretrained knowledge.
Replace the next‑token classification head with a regression head that outputs a scalar reward.
Training Objective
Convert pairwise preference data into binary ranking labels; the chosen response must receive a higher score than the rejected one.
Loss combines a standard pairwise ranking term with a margin loss that reflects the degree of preference (e.g., “clearly better” vs. “almost equal”).
Iterative RLHF Fine‑Tuning
Core Concepts
Agent : Llama 2‑Chat after supervised fine‑tuning (SFT).
Environment : the dialogue context (user prompt + previous turns).
State : current prompt or concatenation of prompt and previous answer.
Action : model‑generated response.
Reward : scalar score from the RM.
Training Loop
Two algorithms are used across RLHF versions V1–V5:
PPO (Proximal Policy Optimization) : an off‑policy RL method that maximizes the RM reward while adding a KL‑divergence penalty to keep the policy close to the original model.
Rejection Sampling : for each prompt, K candidate responses are sampled, scored by the RM, and the highest‑scoring candidate is added to the training set as a new “gold” example.
From V4 onward, rejection sampling first generates high‑quality candidates, then PPO refines the policy using those candidates.
PPO Details
Objective: maximize expected RM reward with a KL penalty.
Optimizer: AdamW (β1=0.9, β2=0.95), weight decay 0.1, gradient clipping 1.0.
Learning rate: constant 1e‑6 for 7B/13B models, 5e‑6 for 70B; batch size 512; PPO clip 0.2; mini‑batch size 64.
KL coefficient: 0.01 for 7B/13B models, 0.005 for 34B/70B models.
Ghost Attention (GAtt) for Multi‑Turn Consistency
Early RLHF models tend to forget the initial system instruction after a few turns. GAtt, inspired by Context Distillation, injects a synthetic system instruction into every user message during fine‑tuning and masks the loss of intermediate assistant messages, forcing the model to retain attention on the instruction across turns.
Experimental Results
Model‑Based Evaluation
Internal safety and helpfulness reward models show that RLHF‑V3 surpasses ChatGPT by over 50% on both metrics.
GPT‑4 based human evaluation reports a >60% win rate for Llama 2‑Chat against ChatGPT after the latest RLHF iterations.
Human Evaluation
Human judges also report higher harmlessness and helpfulness for the latest Llama 2‑Chat RLHF versions compared with baseline models.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
