Artificial Intelligence 32 min read

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

This article provides a thorough, yet concise, overview of Llama 3’s training pipeline, data handling, model architecture, scaling laws, post‑training techniques like SFT and DPO, and inference optimizations such as KV‑Cache, GQA, PagedAttention, and FP8 quantization, highlighting practical insights and benchmark results.

Baobao Algorithm Notes

Sep 28, 2024

Inside Llama 3: A Complete Guide to Modern LLM Training, Architecture, and Optimization

Introduction

The Llama 3 report serves as a foundation for understanding modern large language model (LLM) techniques, covering pre‑training, post‑training, inference, and a range of specific technologies such as Reward Modeling (RM), Direct Preference Optimization (DPO), KV‑Cache, Grouped Query Attention (GQA), and PagedAttention.

1. Training Stages

1.1 Pre‑training

Pre‑training predicts the next token on massive corpora. Llama 3 expands the data volume to roughly 15 trillion multilingual tokens, emphasizing data quality over sheer quantity.

1.2 Key Levers

Meta identifies three key levers for high‑quality foundation models: data, scale, and managing complexity.

Data : Emphasis on cleaning, de‑duplication, and filtering PII or adult content. Scale : Model sizes of 8 B, 70 B, and 405 B are evaluated on benchmarks. Complexity Management : Llama 3 retains a dense Transformer architecture, avoiding Mixture‑of‑Experts (MoE) and using a simplified post‑training pipeline (SFT → RS → DPO) for stability.

1.3 Benchmark Performance

Llama 3’s models are evaluated on MMLU, IFEval, and other vertical benchmarks (code, math, reasoning, tool use, long‑context, multilingual). Notable observations include:

Significant gaps between 8 B and 70 B across most tasks.

Marginal differences between 70 B and 405 B, suggesting diminishing returns at the largest scale.

IFEval results show 8 B achieving 80.4 % accuracy, 70 B 87.5 %, and 405 B 88.6 %.

2. Pre‑training Details

2.1 Data Curation

Data processing includes:

PII and safety filtering : Removal of personally identifiable information and adult content.

Text extraction and cleaning : Parsing raw HTML, discarding Markdown markers.

De‑duplication : URL‑level, document‑level (global MinHash), and line‑level (removing lines appearing >6 times in 30 M‑document windows).

Heuristic filtering : n‑gram repetition removal, toxic‑word filtering, and KL‑divergence based outlier detection.

Model‑based quality filtering : FastText and RoBERTa‑based classifiers to label high‑quality samples.

Specialized data : Code, reasoning, and multilingual subsets.

The final mix is roughly 50 % general knowledge, 25 % math/reasoning, 17 % code, and 8 % multilingual.

2.2 Model Architecture

Llama 3 retains a dense Transformer but introduces:

Grouped Query Attention (GQA) to reduce KV‑Cache size for models ≥70 B.

Extended context window up to 128 K tokens.

Rotary Position Embedding (RoPE) adjustments.

The inference flow consists of tokenization, passing through L transformer blocks, linear projection (lm_head), softmax, and autoregressive sampling.

2.3 KV‑Cache and GQA

During generation, intermediate key/value states are cached (KV‑Cache) to avoid recomputation. Prefill computes KV‑Cache for the entire prompt; decode reuses it for each new token. GQA groups queries so each group shares a KV pair, reducing memory usage.

2.4 Scaling Laws

OpenAI’s scaling law relates compute (FLOPs), model parameters, and data tokens. Llama 3 refines this by predicting downstream task loss (NLL) from compute, then mapping loss to accuracy for specific benchmarks, achieving accurate predictions for tasks like ARC‑Challenge.

2.5 Training Recipe

The three‑step pre‑training recipe:

Initial pre‑training : AdamW optimizer, peak LR (unspecified), linear warm‑up for 8 000 steps, cosine decay over ~1.2 M steps. Batch size grows from 4 M tokens (seq = 4096) to 16 M tokens.

Long‑context pre‑training : Gradual increase of context length from 8 K to 128 K tokens, using ~0.8 T tokens.

Annealing : Final 40 M tokens with linear LR decay to zero, improving performance on GSM8K and MATH for smaller models.

3. Post‑training (Alignment)

3.1 Reward Model (RM)

RM is trained on human preference data (A ≫ B ≫ C = D) to assign scalar scores reflecting safety, helpfulness, or other criteria. Preference data is collected by sampling responses from multiple models, ranking them, and optionally editing the chosen response.

# DPO data format
{
    'prompt': '',
    'chosen': '',
    'rejected': ''
}

3.2 Supervised Fine‑tuning (SFT)

SFT uses standard cross‑entropy loss with prompt masking. Data sources include rejection‑sampled responses, synthetic task‑specific data, and a small set of human‑annotated examples.

3.3 Rejection Sampling & PagedAttention

For each prompt, K (10‑30) responses are generated; the RM scores them, and the highest‑scoring response becomes SFT data. PagedAttention reduces memory waste by storing KV‑Cache in paged blocks, allowing shared computation for identical prompts across multiple responses.

3.4 Direct Preference Optimization (DPO)

DPO merges RM and SFT losses, bypassing separate RM training. It optimizes a policy model directly on chosen‑vs‑rejected pairs. Llama 3 masks formatting tokens in the loss to avoid conflicting gradients and adds an NLL term for the chosen response to preserve its probability.

3.5 Data Processing & Quality Control

Quality pipelines include:

Topic classification using a fine‑tuned classifier.

Quality scoring via RM or Llama‑based prompts (accuracy, instruction‑following, tone).

Difficulty scoring (Instag or Llama‑based intent counts).

Semantic deduplication using RoBERTa embeddings and cosine similarity thresholds.

4. Inference Optimizations

4.1 Parallelism

Both data parallelism (replicating the model across devices, synchronizing gradients) and model parallelism (tensor and pipeline parallelism) are employed. Llama 3 405 B uses two nodes with 8 × NVIDIA H100 GPUs each, leveraging NVLink for tensor parallelism and GPipe‑style pipeline parallelism across nodes.

4.2 Quantization

FP8 quantization, supported natively on H100, is applied to most matrix multiplications (excluding self‑attention parameters and the first/last transformer layers). Row‑wise scaling factors are used. Experiments show up to 50 % throughput increase for 4 K input / 256 output token sequences, with negligible impact on reward‑model scores compared to BF16.

5. Conclusion

The author spent three weekends compiling ~20 k words, acknowledging gaps and promising future refinements. Readers are invited to point out errors or suggest additions.

References

IFEval Dataset – https://paperswithcode.com/dataset/ifeval

LiveBench – https://livebench.ai/

KV‑Cache Optimization notes – https://zhuanlan.zhihu.com/p/697311739

Deep Learning with PyTorch notes – https://zhuanlan.zhihu.com/p/664880302

scaling laws LLM training Inference DPO KV cache Llama 3 GQA

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.