Artificial Intelligence 18 min read

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.

NewBeeNLP

Apr 1, 2024

How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention

Reinforcement Learning with Human Feedback (RLHF)

RLHF fine‑tunes large language models (LLMs) to align their behavior with human preferences and instruction following.

Human Preference Data

Binary comparison data : collected to maximize prompt diversity.

Annotation process : annotators write a prompt, generate two responses (different models or temperature settings), choose the preferred one, and optionally rate the preference as “clearly better”, “better”, “slightly better”, or “almost equal/unsure”.

Metrics : focus on helpfulness (how well the response satisfies the request) and safety (absence of harmful content).

Data are collected weekly and incorporated into successive Llama 2‑Chat fine‑tuning cycles.

Reward Model (RM)

The RM receives the prompt, response, and prior context and outputs a scalar score reflecting helpfulness and safety. Two separate RMs are trained (helpfulness and safety), both initialized from the pretrained Llama 2‑Chat checkpoint.

Reward Model Initialization

Start from the Llama 2‑Chat checkpoint to inherit pretrained knowledge.

Replace the next‑token classification head with a regression head that outputs a scalar reward.

Training Objective

Convert pairwise preference data into binary ranking labels; the chosen response must receive a higher score than the rejected one.

Loss combines a standard pairwise ranking term with a margin loss that reflects the degree of preference (e.g., “clearly better” vs. “almost equal”).

Iterative RLHF Fine‑Tuning

Core Concepts

Agent : Llama 2‑Chat after supervised fine‑tuning (SFT).

Environment : the dialogue context (user prompt + previous turns).

State : current prompt or concatenation of prompt and previous answer.

Action : model‑generated response.

Reward : scalar score from the RM.

Training Loop

Two algorithms are used across RLHF versions V1–V5:

PPO (Proximal Policy Optimization) : an off‑policy RL method that maximizes the RM reward while adding a KL‑divergence penalty to keep the policy close to the original model.

Rejection Sampling : for each prompt, K candidate responses are sampled, scored by the RM, and the highest‑scoring candidate is added to the training set as a new “gold” example.

From V4 onward, rejection sampling first generates high‑quality candidates, then PPO refines the policy using those candidates.

PPO Details

Objective: maximize expected RM reward with a KL penalty.

Optimizer: AdamW (β1=0.9, β2=0.95), weight decay 0.1, gradient clipping 1.0.

Learning rate: constant 1e‑6 for 7B/13B models, 5e‑6 for 70B; batch size 512; PPO clip 0.2; mini‑batch size 64.

KL coefficient: 0.01 for 7B/13B models, 0.005 for 34B/70B models.

Ghost Attention (GAtt) for Multi‑Turn Consistency

Early RLHF models tend to forget the initial system instruction after a few turns. GAtt, inspired by Context Distillation, injects a synthetic system instruction into every user message during fine‑tuning and masks the loss of intermediate assistant messages, forcing the model to retain attention on the instruction across turns.