Artificial Intelligence 27 min read

Llama 2: Open Foundation and Fine‑Tuned Chat Models – Overview and Technical Details

The article provides a comprehensive overview of Meta’s Llama 2 series, detailing model sizes, pre‑training data, architectural enhancements, supervised fine‑tuning, RLHF procedures, safety evaluations, reward‑model training, and iterative improvements, highlighting its open‑source release and comparative performance.

Rare Earth Juejin Tech Community

Feb 18, 2024

Llama 2: Open Foundation and Fine‑Tuned Chat Models – Overview and Technical Details

Introduction

Llama 2 is a family of autoregressive transformer‑based large language models (LLMs) released by Meta, ranging from 7 B to 70 B parameters. Llama 2‑Chat is a version fine‑tuned specifically for dialogue, outperforming many open‑source chat models on benchmark tests and receiving safety and usability evaluations.

Model Versions

Llama 2: three publicly released sizes (7 B, 13 B, 70 B) built on a larger pre‑training corpus and using grouped‑query attention.

Llama 2‑Chat: dialogue‑optimized variants of the same three sizes.

Pre‑training

Data

The pre‑training corpus (~2 trillion tokens) is assembled from publicly available sources such as CommonCrawl, C4, GitHub, Wikipedia, Books, arXiv, and StackExchange, with personal‑information‑rich sites removed.

Training Details

Architecture mirrors Llama 1 with RMSNorm, SwiGLU activation, RoPE embeddings, and an increased context length plus grouped‑query attention. AdamW optimizer, cosine learning‑rate schedule, 2000‑step warm‑up, and gradient clipping are used. Training runs on Meta’s Research Super Cluster (RSC) and an internal production cluster, both equipped with NVIDIA A100 GPUs.

Tokenizer

The same BPE tokenizer as Llama 1 is employed, with a vocabulary size of 32 k.

Fine‑tuning

Supervised Fine‑Tuning (SFT)

High‑quality SFT data (≈27 540 examples) were curated, emphasizing useful and safe instruction‑response pairs. Training uses a cosine learning‑rate schedule (peak 2×10⁻⁵), weight decay 0.1, batch size 6, and a sequence length of 4096 tokens, with prompts and responses concatenated and separated by a special token.

Reinforcement Learning with Human Feedback (RLHF)

Human preference data were collected via binary comparisons of model outputs, annotated for usefulness and safety. Two reward models (usefulness and safety) were trained using a binary ranking loss with a margin term reflecting preference strength. Reward models were mixed with existing open‑source preference datasets.

Training of Reward Models

Reward models are fine‑tuned for one epoch using AdamW, cosine LR decay, and batch size 512 (or 1024 token pairs). Separate models handle usefulness and safety to avoid trade‑off conflicts.

Iterative RLHF

Multiple RLHF iterations (v1‑v5) were performed, alternating between Proximal Policy Optimization (PPO) and Rejection Sampling fine‑tuning. Larger models (70 B) generated candidate responses, which were filtered and used to fine‑tune smaller models.

Proximal Policy Optimization (PPO)

PPO updates follow the standard formulation, incorporating a KL‑penalty term to stabilize training and mitigate reward hacking.

Evaluation and Results

Extensive benchmark tables compare Llama 2 and Llama 2‑Chat against open‑source and closed‑source models, showing competitive or superior performance. Reward models outperform baselines including GPT‑4 on internal safety and usefulness test sets.

Appendix

Model Download Links

Llama‑2‑7b‑hf: https://huggingface.co/meta-llama/Llama-2-7b-hf

Llama‑2‑13b‑hf: https://huggingface.co/meta-llama/Llama-2-13b-hf

Llama‑2‑70b‑hf: https://huggingface.co/meta-llama/Llama-2-70b-hf

… (additional variants and chat models)

Glossary

Red Teaming – adversarial testing of model vulnerabilities.

PPO – Proximal Policy Optimization, a policy‑gradient RL algorithm.

RMSNorm – Root‑Mean‑Square Layer Normalization.

Cosine Learning Rate Decay – schedule that gradually reduces the learning rate following a cosine curve.

Ghost Attention (GATT) – technique to preserve instruction adherence over multiple dialogue turns.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

fine-tuning large language model RLHF open-source AI safety Llama2

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.