Artificial Intelligence 8 min read

Llama 2’s Breakthroughs: Architecture, Data, and Training Tricks Explained

Llama 2 advances open‑source large‑model research by expanding context length to 4096, adopting GQA attention, scaling training data to 2 trillion tokens, and introducing refined SFT and RLHF techniques such as Ghost Attention, margin‑based reward modeling, and iterative rejection sampling, all detailed in Meta’s 76‑page report.

Baobao Algorithm Notes

Jul 19, 2023

Llama 2’s Breakthroughs: Architecture, Data, and Training Tricks Explained

Model Architecture

Attention: Models with more than 30 B parameters use Grouped‑Query Attention (GQA) instead of traditional MHA/MQA to improve scalability.

Normalization & activation: RMSNorm, SwiGLU, and Rotary Positional Embedding (RoPE) are retained.

Training Hyper‑parameters

Optimizer: AdamW with β1=0.9, β2=0.95, ε=1e‑5.

Learning‑rate schedule: cosine decay with 2 000 warm‑up steps; peak learning rate decayed to 10 % of its maximum at the end of training.

Weight decay: 0.1.

Gradient clipping: global norm = 1.0.

Training Data

Supervised Fine‑tuning (SFT) dataset: >100 k instruction‑response pairs.

Reinforcement Learning from Human Feedback (RLHF) dataset: >1 M preference comparisons.

Pre‑training corpus: ~2 trillion tokens (≈40 % larger than Llama 1).

SFT Procedure

Optimizer settings as above; cosine scheduler with initial learning rate 2×10⁻⁵, decay factor 0.1.

Batch size = 64, maximum sequence length = 4096 tokens.

Each training example concatenates the prompt and the answer, separated by special tokens (e.g., <|prompt|>, <|answer|>) to keep alignment.

Loss is masked so that only answer tokens contribute to the gradient; prompt token loss is set to zero.

Two epochs of fine‑tuning are performed on the SFT data.

Ghost Attention for Multi‑turn Dialogue

Ghost Attention (GAtt) introduces an auxiliary “ghost” token that carries instruction context across dialogue turns. During training, the instruction is prepended to every user turn (e.g., inst + u₁, a₁, …, inst + uₙ, aₙ). The ghost token allows the model to attend to the persistent instruction while still processing each turn independently, improving performance on long multi‑turn conversations.

Margin‑Based Reward Modeling

The reward model is trained with a margin loss inspired by metric learning. For a pair of responses (good, bad) the loss enforces a margin m such that score(good) − score(bad) ≥ m. This encourages intra‑class compactness and inter‑class separation, yielding a more discriminative reward signal.

Iterative Rejection Sampling in RLHF

RLHF training proceeds through multiple generations (V1 → V5). For each batch of K sampled prompts, only the highest‑scoring response (according to the reward model) is kept for PPO updates. This rejection‑sampling loop is applied to the 70 B‑parameter Llama variant, demonstrating that larger models benefit from more aggressive sample selection.

References

Model weights: https://huggingface.co/meta-llama

Technical report (76 pages): https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Official homepage: https://ai.meta.com/llama/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Open-source AI SFT RLHF Llama 2

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.