Artificial Intelligence 56 min read

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

This article dissects Meta’s Llama 3 405‑billion‑parameter model, covering its dense Transformer design, data‑mixing strategy, two‑stage scaling‑law prediction, 4‑D parallelism, custom hardware clusters, training schedules, post‑training alignment methods, and the extensive evaluation results that benchmark it against leading LLMs.

Baobao Algorithm Notes

Jul 24, 2024

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

Introduction

Meta released a technical report on Llama 3, the newest family of large language models. The flagship model contains 405 billion dense parameters, a 128K token context window, and a vocabulary of 128 K tokens. The report provides a thorough analysis of the model’s architecture, data pipeline, scaling‑law methodology, training infrastructure, and evaluation results.

1. Pre‑training Data and Mixing

The pre‑training corpus was curated from web data up to 2023, with aggressive cleaning and de‑duplication:

PII removal, adult‑content filtering, and language‑identification for 176 languages.

Multi‑level deduplication (URL, document, line) using MinHash and line‑level heuristics.

Heuristic and model‑based quality filters, including fasttext language models, RoBERTa‑based classifiers, and custom token‑distribution checks.

Data was sampled with fine‑grained tags, resulting in an approximate mix of 50 % general text, 25 % mathematical data, 17 % code, and 8 % multilingual content. Additional pipelines extracted high‑quality code and reasoning data, and a multilingual tokenizer added 28 K extra tokens to improve non‑English performance.

2. Model Architecture

Llama 3 uses a standard dense Transformer with several modest modifications:

Grouped‑Query Attention (GQA) with 8 kv‑heads per layer.

Attention masks to prevent cross‑document interference in long‑context training.

128 K context length supported by RoPE theta = 500 000.

Key hyper‑parameters:

Layers: 126 (405 B), 80 (70 B), 32 (8 B).

Model dimension: 16 384 (405 B).

FFN dimension: 20 480 (405 B).

Attention heads: 128 (405 B), 32 (8 B/70 B).

Peak learning rate: 8 × 10⁻⁵ (405 B).

Activation: SwiGLU.

3. Scaling‑Law Methodology

Traditional scaling laws predict next‑token loss but not downstream benchmark performance. Meta introduced a two‑step approach:

Correlate negative log‑likelihood on downstream tasks with training FLOPs to obtain a task‑specific scaling curve.

Use the curve to map FLOPs‑optimal loss to benchmark accuracy, leveraging older Llama 2 models as anchors.

This method was validated on the ARC Challenge benchmark, showing accurate extrapolation across four orders of magnitude of compute.

For the compute‑optimal model, the fitted power‑law is N(C) = A·C^α with α = 0.53 and A = 0.29. Extrapolation to 3.8 × 10²⁵ FLOPs suggests a 402 B‑parameter model trained on ~16.55 T tokens, justifying the 405 B flagship.

4. Training Infrastructure and Efficiency

Training was performed on Meta’s production clusters rather than the AI‑Research SuperCluster:

GPU resources: up to 16 K H100 GPUs (80 GB HBM3, 700 W TDP) connected via NVLink.

Storage: 240 PB distributed file system (Tectonic) across 7 500 SSD‑equipped servers, delivering 2–7 TB/s sustained throughput.

Network: RoCE over Arista 7800 switches, 400 Gb/s per port, three‑layer Clos topology (24 K GPUs total).

Network optimizations included:

Creating 16 parallel streams per GPU pair for better load balancing.

Enhanced‑ECMP hashing additional packet fields to distribute traffic evenly.

Deep‑buffer switches on the spine to absorb transient congestion from collective communication.

Parallelism combined four techniques (4‑D parallelism): Tensor Parallelism (TP), Context Parallelism (CP), Pipeline Parallelism (PP), and Data Parallelism (DP). Specific improvements:

Added two extra layers (one at the beginning, one at the end) to reduce pipeline bubbles.

Adjusted micro‑batch sizes to be multiples of the number of pipeline stages.

Implemented asynchronous point‑to‑point communication and disabled unnecessary streams to lower latency.

Memory‑efficient strategies:

Used Fully‑Sharded Data Parallel (FSDP) for optimizer state and gradient sharding, but avoided re‑sharding model weights during the forward pass.

Released tensors that would not be used in upcoming steps to reduce peak memory.

5. Training Scheme

The 405 B model was trained in three phases:

Initial pre‑training : Cosine schedule with peak LR = 8 × 10⁻⁵, 8 K warm‑up steps, then decay to 8 × 10⁻⁷ over 1.2 M steps. Batch size grew from 4 M tokens (4 096 context) to 16 M tokens (8 192 context), reaching ~2.87 T tokens.

Long‑context pre‑training : Incrementally increased context from 8 K to 128 K tokens over six stages, training ~800 B tokens at the longest length.

Annealing phase : Final 40 M tokens trained with a linear LR decay to zero while keeping the 128 K context.

Data mix was continuously adjusted, increasing non‑English and mathematical data in later stages to boost multilingual and reasoning capabilities.

6. Post‑training (Alignment)

Alignment consisted of several stages:

Reward modeling (RM) : Trained on human‑annotated preference data, including a third “edited” response tier (edited > chosen > rejected).

Supervised fine‑tuning (SFT) : Used rejected‑sampled responses and synthetic data with a cross‑entropy loss (learning rate = 1e‑5, 8.5K–9K steps).

Direct Preference Optimization (DPO) : Applied a contrastive loss with a β = 0.1 regularizer and masked special tokens to avoid tail‑repetition. DPO was preferred over PPO for efficiency and better performance on instruction‑following benchmarks.

Model averaging : Combined checkpoints from RM, SFT, and DPO stages to improve stability.

Iterative rounds : Six alignment cycles, each collecting new preference data and synthetic samples.

7. Evaluation Results

Llama 3 was evaluated on a wide range of benchmarks:

Standard tasks: commonsense reasoning, knowledge, reading comprehension, math, code, long‑context, and adversarial tests.

Robustness: MMLU multiple‑choice variations (order, format, label changes).

Adversarial suites: Adversarial SQuAD, Dynabench SQuAD, GSM‑Plus, PAWS.

Contamination analysis following Singh et al. (2024) to estimate benchmark leakage.

Across most metrics, Llama 3 matched or exceeded GPT‑4‑level performance, demonstrating the effectiveness of the scaling‑law‑guided model size and the extensive data‑mixing strategy.

8. Reliability and Operational Challenges

During a 54‑day training run, the job experienced 466 interruptions (≈90 % uptime). The majority (≈59 %) were GPU‑related failures; the rest were network or host issues. Meta mitigated downtime with automated firmware upgrades, fast diagnostics using PyTorch NCCL flight recorder, and custom tools to isolate slow‑running GPUs.

Environmental factors such as data‑center temperature caused 1–2 % throughput fluctuations, highlighting the need for power‑aware scheduling in future larger‑scale runs.

Conclusion

The Llama 3 405 B model showcases how careful data curation, a two‑step scaling‑law prediction, and a highly optimized 4‑D parallel training stack can produce a dense LLM that rivals the strongest proprietary systems while remaining open‑source. The report also provides a valuable reference for building future trillion‑parameter models.

Distributed Training AI infrastructure model architecture Llama 3

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.