Artificial Intelligence 17 min read

Why LLaMA 3 405B Matches GPT‑4o: Architecture, Training, and Industry Impact

The article provides an in‑depth analysis of LLaMA 3 405B, covering its dense Transformer architecture, three‑stage pre‑training (initial, long‑context, annealing), iterative post‑training with RM‑guided rejection sampling, the decision against MOE, and the broader implications for both large and small model development.

Baobao Algorithm Notes

Jul 25, 2024

Why LLaMA 3 405B Matches GPT‑4o: Architecture, Training, and Industry Impact

LLaMA 3 Model Architecture

LLaMA 3 follows the now‑standard dense Transformer layout, with a feed‑forward network (FFN) based on a single SwiGLU module. Most contemporary LLMs share this structure; Mixture‑of‑Experts (MoE) variants simply replicate the SwiGLU block K times and add a routing sub‑network, making MoE a Transformer variant rather than a fundamentally new architecture.

LLaMA 3 Pre‑training Process

The pre‑training consists of three phases:

Initial pre‑training : conventional language model training.

Long‑context pre‑training : extends the context window to up to 128 K tokens, using roughly 800 B tokens of long‑text data.

Annealing : during the final 40 M tokens, the learning rate linearly decays to zero while maintaining the 128 K context length; data mixing is adjusted to up‑weight high‑quality math, code, and logic examples, and the average of several checkpoint models is taken as the final model.

LLaMA 3 Post‑Training Pipeline

Post‑training (see Figure 2) proceeds in iterative rounds:

Train a reward model ( RM) on human‑annotated <Prompt, answer> pairs.

Use the RM for rejection sampling: generate multiple answers for each prompt, score them with the RM, and keep the highest‑scoring answer as supervised fine‑tuning ( SFT) data.

Combine the filtered SFT data with additional data that emphasises code, math, and logic, then fine‑tune the model.

Apply Direct Preference Optimisation ( DPO) on triples <Prompt, Good Answer, Bad Answer> to further align the model toward good answers.

Each round repeats the same steps, but the LLM used for generating candidates is the best DPO model from the previous round, creating a positive feedback loop that steadily improves answer quality.

Why LLaMA 3 405B Does Not Use MoE

Empirical studies before the ChatGPT era showed that MoE does not inherently improve model quality; its main benefit is reduced training and inference cost at the expense of stability and higher memory usage. For a 405 B dense model, stability and performance were deemed more important, so Meta chose a dense architecture.

Impact of LLaMA 3 405B

The open‑source release narrows the gap between open and closed models, as illustrated by a performance‑vs‑time curve where the two lines intersect after LLaMA 3 405B. This forces closed‑source providers to justify pricing and pushes open‑source developers to differentiate through specialised features or localisation.

Three Drivers for Small‑Model Advancement

Recent progress in sub‑billion‑parameter models relies on three key techniques:

More and higher‑quality pre‑training data : scaling data quantity while preserving quality improves performance, as seen in models like Pythia and LLaMA 1.

Model distillation : a large “teacher” model provides full token‑level probability distributions to a smaller “student” model, enabling the student to learn richer information than simple next‑token prediction.

Annealing data : up‑sampling high‑quality math, logic, and code data in the final training stage (both pre‑training and post‑training) yields disproportionate gains for smaller models.

Synthetic Data in the Practical Era

During post‑training, synthetic data now dominates SFT creation; a large portion of LLaMA 3’s SFT data is model‑generated. Similar trends appear in other models (e.g., Gemma 2). Synthetic data is often “half‑synthetic,” combining human prompts with model‑generated answers, effectively acting as a form of knowledge distillation.

Key Factors Driving Large‑Model Performance Gains

Beyond scaling model and data size, two trends dominate:

Increasing emphasis on data quality, with sophisticated filtering pipelines.

Rising proportion of mathematics, logic, and code data, both in the later stages of pre‑training (via up‑sampling) and in post‑training, which substantially boosts reasoning abilities.

As generic data becomes saturated, the next performance leap will likely come from synthetic generation of high‑quality math, logic, and code data during post‑training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

pretraining model distillation model architecture Synthetic Data 405B post‑training

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

LLaMA 3 Model Architecture

LLaMA 3 Pre‑training Process

LLaMA 3 Post‑Training Pipeline

Why LLaMA 3 405B Does Not Use MoE

Impact of LLaMA 3 405B

Three Drivers for Small‑Model Advancement

Synthetic Data in the Practical Era

Key Factors Driving Large‑Model Performance Gains

Baobao Algorithm Notes

How this landed with the community

Was this worth your time?

0 Comments

LLaMA 3 Model Architecture

LLaMA 3 Pre‑training Process

LLaMA 3 Post‑Training Pipeline

Why LLaMA 3 405B Does Not Use MoE

Impact of LLaMA 3 405B