Artificial Intelligence 14 min read

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

Meta's Llama 4 series introduces the Scout, Maverick, and Behemoth models—featuring Mixture‑of‑Experts architectures, unprecedented 10‑million‑token context windows, and state‑of‑the‑art performance across vision, language, and multimodal benchmarks—while emphasizing efficient training, open‑source availability, and robust safety safeguards.

21CTO

Apr 7, 2025

Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities

Background

The Llama team announced the first batch of Llama 4 models, designed to enable highly personalized multimodal experiences.

Llama 4 Scout is a 170‑billion‑active‑parameter model with 16 experts (total 109 billion parameters) that runs on a single H100 GPU and offers a 10‑million‑token context window, outperforming Gemma 3, Gemini 2.0 Flash‑Lite, and Mistral 3.1 in benchmarks.

Llama 4 Maverick also has 170 billion active parameters but with 128 experts, surpassing GPT‑4o, Gemini 2.0 Flash, and DeepSeek v3 in inference and coding tasks while delivering the best performance‑cost ratio; its experimental chat version scored 1417 ELO on LMArena.

Both models are open‑source and can be downloaded from llama.com or Hugging Face.

Pre‑training

Llama 4 adopts a Mixture‑of‑Experts (MoE) architecture, activating only a subset of parameters per token, which improves training and inference efficiency under a fixed FLOPs budget.

Llama 4 Maverick contains 170 billion active parameters and 400 billion total parameters, using alternating dense and MoE layers with 128 routing experts plus a shared expert, enabling deployment on a single H100 host or distributed inference.

The models employ early‑fusion multimodal integration and a refined visual encoder based on MetaCLIP, jointly trained with frozen Llama weights.

A new training technique, MetaP , reliably sets key hyper‑parameters (layer‑wise learning rates, initialization scales) that transfer well across batch sizes, model widths, depths, and token counts. Llama 4 is pretrained on 200 languages (over 30 trillion tokens) using FP8 precision, achieving 390 TFLOPs/GPU.

During mid‑training, the team extends context length to 256 K tokens and uses a “teacher” model, Llama 4 Behemoth (2 880 billion active parameters, 2 T total parameters), to distill knowledge into smaller models.

Post‑training

Llama 4 offers multiple model sizes to suit diverse use‑cases. Llama 4 Maverick excels in image‑text understanding, long‑context summarization, and code reasoning, while Llama 4 Scout provides industry‑leading 10‑million‑token context for multi‑document tasks.

The post‑training pipeline follows lightweight supervised fine‑tuning (SFT) → online reinforcement learning (RL) → lightweight direct preference optimization (DPO). Over‑constraining with SFT/DPO can limit RL exploration, so the team removes >50 % of “simple” prompts and focuses RL on harder examples, dramatically improving reasoning and coding performance.

Dynamic filtering of zero‑advantage prompts and mixed‑ability prompt batches are crucial for mathematical and reasoning gains.

Scaling to 2 T‑parameter Behemoth

Llama 4 Behemoth serves as a high‑intelligence teacher model with 2 880 billion active parameters and ~2 trillion total parameters, achieving state‑of‑the‑art results on STEM benchmarks and guiding the distillation of smaller models.

The team introduced a new distillation loss that combines dynamically weighted soft targets with hard targets, allowing efficient knowledge transfer.

Training such a massive model required a revamped RL infrastructure, asynchronous online RL, and optimized MoE parallelism, yielding roughly a ten‑fold speedup over previous generations.

Safety and Protection

Llama 4 incorporates AI‑safety best practices from the Developer Use Guide: AI Protections at every stage—from pre‑training to post‑training—offering adjustable system‑level mitigations to protect developers from adversarial misuse.

For more details, see the original blog post at https://ai.meta.com/blog/llama-4-multimodal-intelligence .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI Mixture of Experts Large Language Model Model Training Long Context AI safety Llama 4

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Pre‑training

Post‑training

Scaling to 2 T‑parameter Behemoth

Safety and Protection

21CTO

How this landed with the community

Was this worth your time?

0 Comments

Scaling to 2 T‑parameter Behemoth