Artificial Intelligence 11 min read

Deep Dive into Llama 2: Architecture, Pre‑training, SFT, and Safety Insights

This article provides a comprehensive technical overview of Meta's Llama 2 series, covering its architectural upgrades such as Group Query Attention, the pre‑training dataset and hyper‑parameters, loss behavior, benchmark comparisons, and the supervised fine‑tuning pipeline with safety considerations.

NewBeeNLP

Mar 27, 2024

Deep Dive into Llama 2: Architecture, Pre‑training, SFT, and Safety Insights

1. Introduction

Llama 2 is Meta's second‑generation family of open‑source large language models, released in four sizes (7 B, 13 B, 30 B, 70 B). Across public leaderboards it consistently outperforms the original LLaMA models.

2. Model Architecture

2.1 Core Components

Tokenizer: Same SentencePiece BPE tokenizer as LLaMA 1 with a 32 k vocabulary; numbers are split into individual digits and unknown UTF‑8 bytes are tokenized.

Pre‑normalization: RMSNorm is applied to the input of each Transformer sub‑layer (pre‑norm) to improve training stability.

SwiGLU activation: Replaces ReLU; the hidden dimension is reduced from 4d to (2/3)·4d, providing smoother gradients, better optimization and faster convergence.

Rotary Position Embeddings (RoPE): Uses rotating position encodings instead of absolute ones, enabling extrapolation to longer sequences and reducing computation.

2.2 Group Query Attention (GQA)

GQA extends the context window and shares key/value cache entries among groups of query heads. This reduces memory consumption while preserving quality close to full multi‑head attention.

GQA sits between standard Multi‑Head Attention (MHA) and Multi‑Query Attention (MQA); it offers MQA‑level speed with minimal quality loss.

3. Pre‑training

3.1 Data

The corpus mixes publicly available sources while explicitly excluding Meta product or service data and attempts to remove personally identifiable information.

Training is performed on roughly 2 trillion tokens, a trade‑off between performance and cost.

Fact‑heavy sources are oversampled to improve knowledge retention and reduce hallucinations.

3.2 Hyper‑parameters and Training

Optimizer: AdamW with β₁ = 0.9, β₂ = 0.95.

Learning‑rate schedule: cosine decay, 2 000 warm‑up steps, final decay to 10 % of the peak learning rate.

Weight decay: 0.1.

Gradient clipping: 1.0.

The training loss curve shows no saturation after processing the full 2 T tokens, indicating continued learning capacity.

3.3 Evaluation

Compared with other open‑source models, Llama 2‑7B and Llama 2‑30B outperform similarly sized MPT models across all benchmark categories. Llama 2‑7B and Llama 2‑34B also beat Falcon 7B/40B, and the 70 B variant surpasses every other open‑source model. Against closed‑source baselines, Llama 2 shows competitive or superior performance on a wide range of tasks.

4. Supervised Fine‑tuning (SFT)

4.1 Data

Initial SFT used publicly available instruction‑tuning data, but quality and diversity were insufficient for conversational LLMs.

High‑quality SFT data were collected from vendor‑annotated examples; a few thousand clean samples yielded better results than massive low‑quality datasets.

Annotation bias across platforms was observed, prompting manual review of a sample of 180 instances.

4.2 Training Details

Learning‑rate: cosine schedule starting at 2e‑5, weight decay 0.1, batch size 64, sequence length 4096.

Each training sample consists of a prompt and an answer separated by a special token:

prompt <sep> answer <eos> prompt <sep> answer

Loss is computed only on answer tokens; prompt tokens are masked to zero.

The model is fine‑tuned for 2 epochs.

5. Llama 2‑Chat Training Flow

Llama 2‑Chat is derived from the base Llama 2 model via supervised fine‑tuning followed by reinforcement learning from human feedback (RLHF). RLHF uses rejection sampling and proximal policy optimization (PPO) to iteratively improve the model.

AI RLHF pretraining model architecture Supervised Fine‑Tuning Llama-2

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Introduction

2. Model Architecture

2.1 Core Components

2.2 Group Query Attention (GQA)

3. Pre‑training

3.1 Data

3.2 Hyper‑parameters and Training

3.3 Evaluation

4. Supervised Fine‑tuning (SFT)

4.1 Data

4.2 Training Details

5. Llama 2‑Chat Training Flow

NewBeeNLP

How this landed with the community

Was this worth your time?

0 Comments

5. Llama 2‑Chat Training Flow