Llama 2: Open Foundation and Fine‑Tuned Chat Models – Overview and Technical Details
The article provides a comprehensive overview of Meta’s Llama 2 series, detailing model sizes, pre‑training data, architectural enhancements, supervised fine‑tuning, RLHF procedures, safety evaluations, reward‑model training, and iterative improvements, highlighting its open‑source release and comparative performance.
Introduction
Llama 2 is a family of autoregressive transformer‑based large language models (LLMs) released by Meta, ranging from 7 B to 70 B parameters. Llama 2‑Chat is a version fine‑tuned specifically for dialogue, outperforming many open‑source chat models on benchmark tests and receiving safety and usability evaluations.
Model Versions
Llama 2: three publicly released sizes (7 B, 13 B, 70 B) built on a larger pre‑training corpus and using grouped‑query attention.
Llama 2‑Chat: dialogue‑optimized variants of the same three sizes.
Pre‑training
Data
The pre‑training corpus (~2 trillion tokens) is assembled from publicly available sources such as CommonCrawl, C4, GitHub, Wikipedia, Books, arXiv, and StackExchange, with personal‑information‑rich sites removed.
Training Details
Architecture mirrors Llama 1 with RMSNorm, SwiGLU activation, RoPE embeddings, and an increased context length plus grouped‑query attention. AdamW optimizer, cosine learning‑rate schedule, 2000‑step warm‑up, and gradient clipping are used. Training runs on Meta’s Research Super Cluster (RSC) and an internal production cluster, both equipped with NVIDIA A100 GPUs.
Tokenizer
The same BPE tokenizer as Llama 1 is employed, with a vocabulary size of 32 k.
Fine‑tuning
Supervised Fine‑Tuning (SFT)
High‑quality SFT data (≈27 540 examples) were curated, emphasizing useful and safe instruction‑response pairs. Training uses a cosine learning‑rate schedule (peak 2×10⁻⁵), weight decay 0.1, batch size 6, and a sequence length of 4096 tokens, with prompts and responses concatenated and separated by a special token.
Reinforcement Learning with Human Feedback (RLHF)
Human preference data were collected via binary comparisons of model outputs, annotated for usefulness and safety. Two reward models (usefulness and safety) were trained using a binary ranking loss with a margin term reflecting preference strength. Reward models were mixed with existing open‑source preference datasets.
Training of Reward Models
Reward models are fine‑tuned for one epoch using AdamW, cosine LR decay, and batch size 512 (or 1024 token pairs). Separate models handle usefulness and safety to avoid trade‑off conflicts.
Iterative RLHF
Multiple RLHF iterations (v1‑v5) were performed, alternating between Proximal Policy Optimization (PPO) and Rejection Sampling fine‑tuning. Larger models (70 B) generated candidate responses, which were filtered and used to fine‑tune smaller models.
Proximal Policy Optimization (PPO)
PPO updates follow the standard formulation, incorporating a KL‑penalty term to stabilize training and mitigate reward hacking.
Evaluation and Results
Extensive benchmark tables compare Llama 2 and Llama 2‑Chat against open‑source and closed‑source models, showing competitive or superior performance. Reward models outperform baselines including GPT‑4 on internal safety and usefulness test sets.
Appendix
Model Download Links
Llama‑2‑7b‑hf: https://huggingface.co/meta-llama/Llama-2-7b-hf
Llama‑2‑13b‑hf: https://huggingface.co/meta-llama/Llama-2-13b-hf
Llama‑2‑70b‑hf: https://huggingface.co/meta-llama/Llama-2-70b-hf
… (additional variants and chat models)
Glossary
Red Teaming – adversarial testing of model vulnerabilities.
PPO – Proximal Policy Optimization, a policy‑gradient RL algorithm.
RMSNorm – Root‑Mean‑Square Layer Normalization.
Cosine Learning Rate Decay – schedule that gradually reduces the learning rate following a cosine curve.
Ghost Attention (GATT) – technique to preserve instruction adherence over multiple dialogue turns.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.