Llama 2’s Breakthroughs: Architecture, Data, and Training Tricks Explained
Llama 2 advances open‑source large‑model research by expanding context length to 4096, adopting GQA attention, scaling training data to 2 trillion tokens, and introducing refined SFT and RLHF techniques such as Ghost Attention, margin‑based reward modeling, and iterative rejection sampling, all detailed in Meta’s 76‑page report.
Model Architecture
Attention: Models with more than 30 B parameters use Grouped‑Query Attention (GQA) instead of traditional MHA/MQA to improve scalability.
Normalization & activation: RMSNorm, SwiGLU, and Rotary Positional Embedding (RoPE) are retained.
Training Hyper‑parameters
Optimizer: AdamW with β1=0.9, β2=0.95, ε=1e‑5.
Learning‑rate schedule: cosine decay with 2 000 warm‑up steps; peak learning rate decayed to 10 % of its maximum at the end of training.
Weight decay: 0.1.
Gradient clipping: global norm = 1.0.
Training Data
Supervised Fine‑tuning (SFT) dataset: >100 k instruction‑response pairs.
Reinforcement Learning from Human Feedback (RLHF) dataset: >1 M preference comparisons.
Pre‑training corpus: ~2 trillion tokens (≈40 % larger than Llama 1).
SFT Procedure
Optimizer settings as above; cosine scheduler with initial learning rate 2×10⁻⁵, decay factor 0.1.
Batch size = 64, maximum sequence length = 4096 tokens.
Each training example concatenates the prompt and the answer, separated by special tokens (e.g., <|prompt|>, <|answer|>) to keep alignment.
Loss is masked so that only answer tokens contribute to the gradient; prompt token loss is set to zero.
Two epochs of fine‑tuning are performed on the SFT data.
Ghost Attention for Multi‑turn Dialogue
Ghost Attention (GAtt) introduces an auxiliary “ghost” token that carries instruction context across dialogue turns. During training, the instruction is prepended to every user turn (e.g., inst + u₁, a₁, …, inst + uₙ, aₙ). The ghost token allows the model to attend to the persistent instruction while still processing each turn independently, improving performance on long multi‑turn conversations.
Margin‑Based Reward Modeling
The reward model is trained with a margin loss inspired by metric learning. For a pair of responses (good, bad) the loss enforces a margin m such that score(good) − score(bad) ≥ m. This encourages intra‑class compactness and inter‑class separation, yielding a more discriminative reward signal.
Iterative Rejection Sampling in RLHF
RLHF training proceeds through multiple generations (V1 → V5). For each batch of K sampled prompts, only the highest‑scoring response (according to the reward model) is kept for PPO updates. This rejection‑sampling loop is applied to the 70 B‑parameter Llama variant, demonstrating that larger models benefit from more aggressive sample selection.
References
Model weights: https://huggingface.co/meta-llama
Technical report (76 pages): https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/
Official homepage: https://ai.meta.com/llama/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
