Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization
This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.
Introduction
The Llama 3 report serves as a foundation for understanding modern large language model (LLM) techniques, covering pre‑training, post‑training, inference, and a range of specific technologies such as Reward Modeling (RM), Direct Preference Optimization (DPO), KV‑Cache, Grouped Query Attention (GQA), and PagedAttention.
1. Training Stages
1.1 Pre‑training
Pre‑training predicts the next token on massive corpora. Llama 3 expands the data volume to roughly 15 trillion multilingual tokens, emphasizing data quality over sheer quantity.
1.2 Key Levers
Meta identifies three key levers for high‑quality foundation models: data, scale, and managing complexity.
Data : Emphasis on cleaning, de‑duplication, and filtering PII or adult content. Scale : Model sizes of 8 B, 70 B, and 405 B are evaluated on benchmarks. Complexity Management : Llama 3 retains a dense Transformer architecture, avoiding Mixture‑of‑Experts (MoE) and using a simplified post‑training pipeline (SFT → RS → DPO) for stability.
1.3 Benchmark Performance
Llama 3’s models are evaluated on MMLU, IFEval, and other vertical benchmarks (code, math, reasoning, tool use, long‑context, multilingual). Notable observations include:
Significant gaps between 8 B and 70 B across most tasks.
Marginal differences between 70 B and 405 B, suggesting diminishing returns at the largest scale.
IFEval results show 8 B achieving 80.4 % accuracy, 70 B 87.5 %, and 405 B 88.6 %.
2. Pre‑training Details
2.1 Data Curation
Data processing includes:
PII and safety filtering : Removal of personally identifiable information and adult content.
Text extraction and cleaning : Parsing raw HTML, discarding Markdown markers.
De‑duplication : URL‑level, document‑level (global MinHash), and line‑level (removing lines appearing >6 times in 30 M‑document windows).
Heuristic filtering : n‑gram repetition removal, toxic‑word filtering, and KL‑divergence based outlier detection.
Model‑based quality filtering : FastText and RoBERTa‑based classifiers to label high‑quality samples.
Specialized data : Code, reasoning, and multilingual subsets.
The final mix is roughly 50 % general knowledge, 25 % math/reasoning, 17 % code, and 8 % multilingual.
2.2 Model Architecture
Llama 3 retains a dense Transformer but introduces:
Grouped Query Attention (GQA) to reduce KV‑Cache size for models ≥70 B.
Extended context window up to 128 K tokens.
Rotary Position Embedding (RoPE) adjustments.
The inference flow consists of tokenization, passing through L transformer blocks, linear projection (lm_head), softmax, and autoregressive sampling.
2.3 KV‑Cache and GQA
During generation, intermediate key/value states are cached (KV‑Cache) to avoid recomputation. Prefill computes KV‑Cache for the entire prompt; decode reuses it for each new token. GQA groups queries so each group shares a KV pair, reducing memory usage.
2.4 Scaling Laws
OpenAI’s scaling law relates compute (FLOPs), model parameters, and data tokens. Llama 3 refines this by predicting downstream task loss (NLL) from compute, then mapping loss to accuracy for specific benchmarks, achieving accurate predictions for tasks like ARC‑Challenge.
2.5 Training Recipe
The three‑step pre‑training recipe:
Initial pre‑training : AdamW optimizer, peak LR (unspecified), linear warm‑up for 8 000 steps, cosine decay over ~1.2 M steps. Batch size grows from 4 M tokens (seq = 4096) to 16 M tokens.
Long‑context pre‑training : Gradual increase of context length from 8 K to 128 K tokens, using ~0.8 T tokens.
Annealing : Final 40 M tokens with linear LR decay to zero, improving performance on GSM8K and MATH for smaller models.
3. Post‑training (Alignment)
3.1 Reward Model (RM)
RM is trained on human preference data (A ≫ B ≫ C = D) to assign scalar scores reflecting safety, helpfulness, or other criteria. Preference data is collected by sampling responses from multiple models, ranking them, and optionally editing the chosen response.
# DPO data format
{
'prompt': '',
'chosen': '',
'rejected': ''
}3.2 Supervised Fine‑tuning (SFT)
SFT uses standard cross‑entropy loss with prompt masking. Data sources include rejection‑sampled responses, synthetic task‑specific data, and a small set of human‑annotated examples.
3.3 Rejection Sampling & PagedAttention
For each prompt, K (10‑30) responses are generated; the RM scores them, and the highest‑scoring response becomes SFT data. PagedAttention reduces memory waste by storing KV‑Cache in paged blocks, allowing shared computation for identical prompts across multiple responses.
3.4 Direct Preference Optimization (DPO)
DPO merges RM and SFT losses, bypassing separate RM training. It optimizes a policy model directly on chosen‑vs‑rejected pairs. Llama 3 masks formatting tokens in the loss to avoid conflicting gradients and adds an NLL term for the chosen response to preserve its probability.
3.5 Data Processing & Quality Control
Quality pipelines include:
Topic classification using a fine‑tuned classifier.
Quality scoring via RM or Llama‑based prompts (accuracy, instruction‑following, tone).
Difficulty scoring (Instag or Llama‑based intent counts).
Semantic deduplication using RoBERTa embeddings and cosine similarity thresholds.
4. Inference Optimizations
4.1 Parallelism
Both data parallelism (replicating the model across devices, synchronizing gradients) and model parallelism (tensor and pipeline parallelism) are employed. Llama 3 405 B uses two nodes with 8 × NVIDIA H100 GPUs each, leveraging NVLink for tensor parallelism and GPipe‑style pipeline parallelism across nodes.
4.2 Quantization
FP8 quantization, supported natively on H100, is applied to most matrix multiplications (excluding self‑attention parameters and the first/last transformer layers). Row‑wise scaling factors are used. Experiments show up to 50 % throughput increase for 4 K input / 256 output token sequences, with negligible impact on reward‑model scores compared to BF16.
5. Conclusion
The author spent three weekends compiling ~20 k words, acknowledging gaps and promising future refinements. Readers are invited to point out errors or suggest additions.
References
IFEval Dataset – https://paperswithcode.com/dataset/ifeval
LiveBench – https://livebench.ai/
KV‑Cache Optimization notes – https://zhuanlan.zhihu.com/p/697311739
Deep Learning with PyTorch notes – https://zhuanlan.zhihu.com/p/664880302
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
