14 min read

What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights

The article provides a detailed technical overview of Gemma 2, covering its decoder‑only transformer design, novel attention mechanisms, logit soft‑capping, RMSNorm, knowledge‑distillation training on trillions of tokens, extensive pre‑training infrastructure, and benchmark evaluations that demonstrate its competitiveness against larger proprietary models.

Baobao Algorithm Notes

Jun 28, 2024

What Makes Gemma 2 a Competitive Open‑Source LLM? Architecture, Training, and Evaluation Insights

Introduction

Large language models (LLMs) have demonstrated strong capabilities in language understanding, generation, and reasoning. Gemma 2 investigates richer training objectives—particularly knowledge distillation—to improve the performance of small‑scale models without simply increasing token count.

Model Architecture

Gemma 2 follows a decoder‑only Transformer architecture with the following design choices:

Context length: 8192 tokens with RoPE positional embeddings.

Activation: GeGLU.

Attention pattern: alternating local sliding‑window attention (4096‑token window) and global attention (8192‑token span) per layer.

Logit soft‑capping: logits are clipped to ±50.0 in early layers and ±30.0 in final layers to improve training stability.

Normalization: RMSNorm applied both as pre‑norm and post‑norm.

Grouped Query Attention (GQA) with num_groups=2 for faster inference.

Released model sizes are 2.6 B, 9 B, and 27 B parameters. The 27 B model uses 46 layers, a hidden size of 4608, and 32 attention heads (128‑dimensional head size). All variants share a 256 k token vocabulary and tied embeddings.

Pre‑training

Training data consists of English‑dominant corpora totaling 1.3 trillion tokens for the 27 B model, 8 trillion for the 9 B model, and 2 trillion for the 2.6 B model. Sources include web pages, source code, and scientific articles. The same SentencePiece tokenizer as Gemma 1 is used (256 k vocabulary).

Data filtering follows the Gemma 1 pipeline to remove unsafe content, personally identifiable information, and evaluation data to reduce memorization.

Knowledge distillation is employed: a large teacher model provides token‑level probability distributions, and the student model minimizes the negative log‑likelihood between teacher and student outputs. To keep storage feasible, only a sampled subset of the 256 k vocabulary probabilities is stored.

Training infrastructure uses TPUv4, TPUv5e, and TPUv5p pods. The 2.6 B model runs on 512 chips (2 × 16 × 16 configuration), the 9 B model on 4096 chips (8 × 16 × 32), and the 27 B model on 6144 chips (8 × 24 × 32). Optimizer state is sharded with ZeRO‑3‑like techniques, and Pathways is used for data‑parallel scaling.

Evaluation

Gemma 2 is evaluated on a suite of automated benchmarks and human‑rated tasks covering question answering, commonsense reasoning, mathematics, science, and programming. The 9 B and 27 B models achieve state‑of‑the‑art performance among open models of comparable size and remain competitive with proprietary models up to twice as large.

The 27 B model, trained on 1.3 T tokens without distillation, outperforms similarly sized open models and approaches the performance of larger models such as LLaMA‑3 70 B.

Distilled 2.6 B and 9 B models improve up to 10 % on certain benchmarks compared to their non‑distilled predecessors, confirming the effectiveness of distillation even with equal token budgets.

Human evaluations on the LMSYS Chatbot Arena demonstrate that Gemma 9 B and 27 B set new open‑weight state‑of‑the‑art scores.

Safety and Deployment

Although extensively tested, the models may exhibit unforeseen behavior in specific applications. Users are advised to conduct rigorous safety testing tailored to their use cases before deployment.