Llama 4 Unveiled: Breakthrough Multimodal Models Redefine AI Capabilities
Meta's Llama 4 series introduces the Scout, Maverick, and Behemoth models—featuring Mixture‑of‑Experts architectures, unprecedented 10‑million‑token context windows, and state‑of‑the‑art performance across vision, language, and multimodal benchmarks—while emphasizing efficient training, open‑source availability, and robust safety safeguards.
Background
The Llama team announced the first batch of Llama 4 models, designed to enable highly personalized multimodal experiences.
Llama 4 Scout is a 170‑billion‑active‑parameter model with 16 experts (total 109 billion parameters) that runs on a single H100 GPU and offers a 10‑million‑token context window, outperforming Gemma 3, Gemini 2.0 Flash‑Lite, and Mistral 3.1 in benchmarks.
Llama 4 Maverick also has 170 billion active parameters but with 128 experts, surpassing GPT‑4o, Gemini 2.0 Flash, and DeepSeek v3 in inference and coding tasks while delivering the best performance‑cost ratio; its experimental chat version scored 1417 ELO on LMArena.
Both models are open‑source and can be downloaded from llama.com or Hugging Face.
Pre‑training
Llama 4 adopts a Mixture‑of‑Experts (MoE) architecture, activating only a subset of parameters per token, which improves training and inference efficiency under a fixed FLOPs budget.
Llama 4 Maverick contains 170 billion active parameters and 400 billion total parameters, using alternating dense and MoE layers with 128 routing experts plus a shared expert, enabling deployment on a single H100 host or distributed inference.
The models employ early‑fusion multimodal integration and a refined visual encoder based on MetaCLIP, jointly trained with frozen Llama weights.
A new training technique, MetaP , reliably sets key hyper‑parameters (layer‑wise learning rates, initialization scales) that transfer well across batch sizes, model widths, depths, and token counts. Llama 4 is pretrained on 200 languages (over 30 trillion tokens) using FP8 precision, achieving 390 TFLOPs/GPU.
During mid‑training, the team extends context length to 256 K tokens and uses a “teacher” model, Llama 4 Behemoth (2 880 billion active parameters, 2 T total parameters), to distill knowledge into smaller models.
Post‑training
Llama 4 offers multiple model sizes to suit diverse use‑cases. Llama 4 Maverick excels in image‑text understanding, long‑context summarization, and code reasoning, while Llama 4 Scout provides industry‑leading 10‑million‑token context for multi‑document tasks.
The post‑training pipeline follows lightweight supervised fine‑tuning (SFT) → online reinforcement learning (RL) → lightweight direct preference optimization (DPO). Over‑constraining with SFT/DPO can limit RL exploration, so the team removes >50 % of “simple” prompts and focuses RL on harder examples, dramatically improving reasoning and coding performance.
Dynamic filtering of zero‑advantage prompts and mixed‑ability prompt batches are crucial for mathematical and reasoning gains.
Scaling to 2 T‑parameter Behemoth
Llama 4 Behemoth serves as a high‑intelligence teacher model with 2 880 billion active parameters and ~2 trillion total parameters, achieving state‑of‑the‑art results on STEM benchmarks and guiding the distillation of smaller models.
The team introduced a new distillation loss that combines dynamically weighted soft targets with hard targets, allowing efficient knowledge transfer.
Training such a massive model required a revamped RL infrastructure, asynchronous online RL, and optimized MoE parallelism, yielding roughly a ten‑fold speedup over previous generations.
Safety and Protection
Llama 4 incorporates AI‑safety best practices from the Developer Use Guide: AI Protections at every stage—from pre‑training to post‑training—offering adjustable system‑level mitigations to protect developers from adversarial misuse.
For more details, see the original blog post at https://ai.meta.com/blog/llama-4-multimodal-intelligence .
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
