Artificial Intelligence 16 min read

Inside Llama 4: How Meta’s New Multimodal MoE Models Achieve 10M‑Token Contexts

Meta unveils Llama 4 Scout, Maverick, and the upcoming Behemoth, detailing their Mixture‑of‑Experts architecture, massive 10‑million‑token context windows, efficient FP8 training, safety mechanisms, and competitive benchmark results that surpass leading multimodal models.

Baobao Algorithm Notes

Apr 6, 2025

Inside Llama 4: How Meta’s New Multimodal MoE Models Achieve 10M‑Token Contexts

Model Overview

Llama 4 introduces two open‑source multimodal Mixture‑of‑Experts (MoE) models that can run on a single NVIDIA H100 GPU:

Scout : 170 billion active parameters, 16 experts, 1.09 trillion total parameters, 10 million token context window.

Maverick : 170 billion active parameters, 128 routing experts plus a shared expert, 4 trillion total parameters, same 10 million token context.

Both models use alternating dense and MoE layers, activating only a subset of parameters per token, which reduces inference latency and cost while preserving quality.

Pre‑training Architecture and Data

The backbone is a unified multimodal transformer that fuses text, image and video tokens early in the network. The visual encoder is based on MetaCLIP and is trained separately but frozen when paired with the LLM, improving cross‑modal alignment.

Training employed the new MetaP technique to scale learning‑rate and initialization hyper‑parameters reliably across batch sizes, model widths, depths and token counts. The dataset comprised over 30 trillion tokens in 200 languages (more than 100 languages with >1 billion tokens each), mixing text, image and video data.

Precision: FP8 was used throughout pre‑training without quality loss, achieving a per‑GPU throughput of 390 TFLOPs on a 32 k GPU run. Mid‑stage training extended the context length to 10 million tokens and introduced an interleaved‑attention layer without positional embeddings, called iRoPE , plus temperature‑scaled attention at inference to improve length generalisation.

Post‑training Pipeline

Fine‑tuning follows a three‑stage pipeline:

Lightweight supervised fine‑tuning (SFT) : only the hardest 50 % of data are kept after filtering out easy examples.

Online reinforcement learning (RL) : a dynamic data filter retains medium‑to‑hard prompts, and an asynchronous RL framework iteratively trains and re‑filters prompts.

Lightweight direct preference optimisation (DPO) : applied after RL to address edge‑case response quality.

This strategy avoids over‑constraining the model and yields strong reasoning, coding and multimodal performance.

2‑Trillion‑Parameter Behemoth Teacher

Llama 4 Behemoth is a 2.88 trillion‑parameter MoE model with 16 experts (≈2 × 10¹² total parameters). It serves as a teacher for distilling smaller models. A novel distillation loss combines dynamically weighted soft targets with hard targets, reducing the forward‑pass cost for student training.

Training the Behemoth required pruning 95 % of SFT data and a custom RL infrastructure that parallelises MoE across GPUs, delivering roughly a ten‑fold efficiency gain over previous generations.

Safety and Protection Mechanisms

System‑level safeguards are integrated at every stage:

Llama Guard : input/output safety model based on the MLCommons harm taxonomy.

Prompt Guard : classifier trained on large attack corpora to detect jailbreak and prompt‑injection attempts.

CyberSecEval : evaluation suite for generative AI cybersecurity risks.

GOAT (Generative Offensive Attack Tester) : automated red‑team framework that simulates multi‑turn adversarial interactions to surface vulnerabilities.

Bias mitigation results show that Llama 4 declines controversial political/social queries in less than 2 % of cases (down from 7 % for Llama 3.3) and exhibits a more balanced refusal distribution (<1 % imbalance), approaching the behaviour of competing models such as Grok.

Availability

The Scout and Maverick checkpoints are downloadable from https://llama.com and the Hugging Face model hub. They can also be accessed via Meta AI integrations on WhatsApp, Messenger, Instagram Direct and the Meta.AI website.

multimodal AI Mixture of Experts AI safety Llama 4

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.