Artificial Intelligence 25 min read

Evolution, Architecture, Training Data, Methods, and Performance of Meta's Llama Series (Llama 1, 2, 3)

Meta's Llama series has progressed from the 7‑65B Llama‑1 in early 2023 to the 8B and 70B Llama‑3 in 2024, scaling token counts from 1 T to over 15 T, adopting decoder‑only Transformers with RMSNorm, SwiGLU, RoPE and GQA, and adding supervised fine‑tuning, RLHF and DPO, resulting in state‑of‑the‑art benchmark performance and a vibrant open‑source ecosystem.

Sohu Tech Products

Apr 24, 2024

Evolution, Architecture, Training Data, Methods, and Performance of Meta's Llama Series (Llama 1, 2, 3)

Introduction

In the AI field, large‑scale language models are advancing at an unprecedented pace. On April 19, 2024 Meta announced Llama‑3, the third generation of its open‑source Llama family, which surpasses existing state‑of‑the‑art models on many benchmarks.

1. Evolution History

Llama‑1 (Feb 2023) : released with 7B, 13B, 30B and 65B parameter variants. Trained on >1 T tokens using publicly available corpora; the 65B model was trained for ~21 days on 2,048 A100 80 GB GPUs. It quickly became a cornerstone of the open‑source LLM community.

Llama‑2 (Jul 2023) : a free‑for‑commercial series (7B, 13B, 34B, 70B). Context length doubled to 4 K, introduced Grouped‑Query Attention (GQA) for the larger variants, and added a Chat fine‑tuned line‑up (Llama‑2‑Chat). Code‑generation model Code‑Llama was also released.

Llama‑3 (Apr 2024) : launched with 8B and 70B models; a 400B version is still in training. Supports 8 K context, uses a 128 K token vocabulary (tiktoken), and is trained on >15 T tokens—about eight times the data used for Llama‑2.

2. Model Architecture

All Llama models adopt a decoder‑only Transformer architecture similar to GPT. Key architectural tweaks include:

RMSNorm for layer normalization.

SwiGLU activation function.

Rotary Position Embedding (RoPE) for positional encoding.

Grouped‑Query Attention (GQA) in many variants to balance efficiency and performance.

Llama‑1/2 use a BPE tokenizer (sentencepiece) with a 32 K vocabulary; Llama‑3 switches to tiktoken with a 128 K vocabulary.

3. Training Data

Llama‑1 : ~1.4 T tokens drawn from CommonCrawl, C4, GitHub (Apache/BSD/MIT‑licensed repos), Wikipedia (20 languages), Gutenberg, ThePile‑Books3, ArXiv LaTeX sources, and Stack‑Exchange. Data were heavily filtered, deduplicated, and quality‑checked.

Llama‑2 : ~2 T tokens from publicly available sources (exact mix undisclosed). Additional supervised fine‑tuning on 27 540 instruction‑response pairs and RLHF on 1.4 M human‑feedback examples. Safety‑oriented filtering and privacy reviews were applied.

Llama‑3 : ~15 T tokens, with a four‑fold increase in code data and >5 % non‑English tokens covering 30+ languages. A multi‑stage filtering pipeline (heuristics, NSFW filter, semantic deduplication, quality classifier) ensures high‑quality data.

4. Training Methods

Llama‑1 : pure self‑supervised pre‑training using AdamW, cosine learning‑rate schedule, 0.1 weight decay, gradient clipping, and extensive system‑level optimizations (xformers causal attention, checkpointing, model/sequence parallelism).

Llama‑2 : same pre‑training backbone plus supervised fine‑tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) using PPO and rejection sampling.

Llama‑3 : follows scaling‑law‑driven data selection, trains up to 15 T tokens, employs data, model, and pipeline parallelism across 16 K GPUs (≈400 TFLOPS per GPU). The training stack includes automatic error detection, robust checkpointing, and a high‑throughput storage system, achieving >95 % hardware utilization. Fine‑tuning combines SFT, RLHF (PPO), and Direct Policy Optimization (DPO).

5. Performance Comparison

Official benchmarks show Llama‑2 consistently outperforms Llama‑1 and other open‑source LLMs. Llama‑3 8B surpasses Mistral‑7B and Gemma‑7B; Llama‑3 70B exceeds Claude‑3 Sonnet and is comparable to Gemini Pro 1.5. The upcoming 400B model is reported to approach GPT‑4‑Turbo/Claude‑3‑Opus performance.

6. Community Impact

The open‑source nature of Llama has acted as an “Android” for LLMs, lowering entry barriers and fostering a vibrant ecosystem of derivatives, tools, and research. It has enabled startups, academia, and large enterprises to build AI solutions without relying on proprietary APIs, improving data‑security and cost‑control. Adoption spans cloud platforms (AWS, GCP), edge devices, and numerous vertical applications.

7. Conclusion

The Llama series demonstrates how open‑source large language models can drive rapid technical progress, broaden AI accessibility, and shape future research directions. Continued improvements in model scale, multilingual capability, and safety mechanisms suggest that Llama will remain a central pillar of the global AI landscape.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Large Language Models LLaMA performance evaluation training data model architecture

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.