How Llama 2 Uses RLHF, PPO, Rejection Sampling, and Ghost Attention
This article provides a detailed technical walkthrough of Llama 2's Reinforcement Learning with Human Feedback pipeline, covering human preference data collection, reward‑model design and training, iterative fine‑tuning with PPO and rejection sampling, the Ghost Attention technique for multi‑turn consistency, and the resulting experimental evaluations.
