How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

Reinforcement Learning with Human Feedback (RLHF)

RLHF fine‑tunes large language models (LLMs) to align their behavior with human preferences and instruction following.

Human Preference Data

Binary comparison data : collected to maximize prompt diversity.

Annotation process : annotators write a prompt, generate two responses (different models or temperature settings), choose the preferred one, and optionally rate the preference as “clearly better”, “better”, “slightly better”, or “almost equal/unsure”.

Metrics : focus on helpfulness (how well the response satisfies the request) and safety (absence of harmful content).

Data are collected weekly and incorporated into successive Llama 2‑Chat fine‑tuning cycles.

Reward Model (RM)

The RM receives the prompt, response, and prior context and outputs a scalar score reflecting helpfulness and safety. Two separate RMs are trained (helpfulness and safety), both initialized from the pretrained Llama 2‑Chat checkpoint.

Reward Model Initialization

Start from the Llama 2‑Chat checkpoint to inherit pretrained knowledge.

Replace the next‑token classification head with a regression head that outputs a scalar reward.

Training Objective

Convert pairwise preference data into binary ranking labels; the chosen response must receive a higher score than the rejected one.

Loss combines a standard pairwise ranking term with a margin loss that reflects the degree of preference (e.g., “clearly better” vs. “almost equal”).

Reward model loss diagram
Reward model loss diagram

Iterative RLHF Fine‑Tuning

Core Concepts

Agent : Llama 2‑Chat after supervised fine‑tuning (SFT).

Environment : the dialogue context (user prompt + previous turns).

State : current prompt or concatenation of prompt and previous answer.

Action : model‑generated response.

Reward : scalar score from the RM.

Training Loop

Two algorithms are used across RLHF versions V1–V5:

PPO (Proximal Policy Optimization) : an off‑policy RL method that maximizes the RM reward while adding a KL‑divergence penalty to keep the policy close to the original model.

Rejection Sampling : for each prompt, K candidate responses are sampled, scored by the RM, and the highest‑scoring candidate is added to the training set as a new “gold” example.

From V4 onward, rejection sampling first generates high‑quality candidates, then PPO refines the policy using those candidates.

PPO Details

Objective: maximize expected RM reward with a KL penalty.

Optimizer: AdamW (β1=0.9, β2=0.95), weight decay 0.1, gradient clipping 1.0.

Learning rate: constant 1e‑6 for 7B/13B models, 5e‑6 for 70B; batch size 512; PPO clip 0.2; mini‑batch size 64.

KL coefficient: 0.01 for 7B/13B models, 0.005 for 34B/70B models.

PPO hyperparameters
PPO hyperparameters

Ghost Attention (GAtt) for Multi‑Turn Consistency

Early RLHF models tend to forget the initial system instruction after a few turns. GAtt, inspired by Context Distillation, injects a synthetic system instruction into every user message during fine‑tuning and masks the loss of intermediate assistant messages, forcing the model to retain attention on the instruction across turns.

Ghost Attention diagram
Ghost Attention diagram

Experimental Results

Model‑Based Evaluation

Internal safety and helpfulness reward models show that RLHF‑V3 surpasses ChatGPT by over 50% on both metrics.

GPT‑4 based human evaluation reports a >60% win rate for Llama 2‑Chat against ChatGPT after the latest RLHF iterations.

Model evaluation results
Model evaluation results

Human Evaluation

Human judges also report higher harmlessness and helpfulness for the latest Llama 2‑Chat RLHF versions compared with baseline models.

Human evaluation results
Human evaluation results
Reward ModelRLHFPPOGhost AttentionLlama-2Rejection Sampling
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.