Artificial Intelligence 8 min read

Meta Llama 4 Scout, Maverick, and Behemoth: Architecture, NoPE Innovation, and Training Advances

The article introduces Meta's newly open‑sourced Llama 4 series—including Scout with a 1 billion‑token context window, Maverick with 400 billion parameters, and the upcoming Behemoth teacher model—detailing their expert‑mix architecture, the NoPE positional‑encoding removal, training pipelines, performance benchmarks, and infrastructure improvements for large‑scale AI research.

DevOps

Apr 7, 2025

Meta Llama 4 Scout, Maverick, and Behemoth: Architecture, NoPE Innovation, and Training Advances

In early 2024 Google released Gemini 2.0 Pro with a 2‑million token context window, prompting Meta to launch Llama 4 Scout, which expands the context to 10 million tokens—enough to read an entire novel like *War and Peace* in a single pass.

Llama 4 Scout is an expert‑mix model with 109 billion parameters, 170 billion active parameters, and 16 expert routers, running on a single H100 GPU with native multimodal support for up to eight images. Its key architectural innovation is the NoPE (No Position Encoding) layer, which removes explicit positional embeddings and instead recovers absolute positions in the first layer and learns relative positions in deeper layers, improving length‑generalization and computational efficiency.

Benchmarks show NoPE outperforms explicit positional encodings on length‑generalization tasks, achieving 0.69 accuracy on a 40‑token addition task versus <0.55 for other methods, and delivering higher efficiency on long‑sequence workloads.

Meta also open‑sourced Llama 4 Maverick, a 400 billion‑parameter expert‑mix model with 170 billion active parameters and 128 routers, supporting a 1 million‑token context window. Maverick uses a training pipeline of lightweight Supervised Fine‑Tuning (SFT), online Reinforcement Learning (RL), and Direct Preference Optimization (DPO). To avoid over‑constraining the model, Meta pruned 50 % of “simple” data from SFT, focusing the remaining data on more challenging prompts, and introduced a continuous online RL strategy that alternates training and selective prompt filtering.

Meta is currently training a 2 trillion‑parameter teacher model, Llama 4 Behemoth, with 2 880 billion active parameters and 16 expert routers, intended for distillation and fine‑tuning of smaller models such as Maverick. Training such a massive model required a new asynchronous online RL framework that distributes models across multiple GPUs, improving training efficiency by roughly tenfold.

Experimental results indicate that Llama 4 Scout, Maverick, and Behemoth outperform contemporaries like Gemini 2.0 Pro, GPT‑4o, and DeepSeek V3 on benchmarks such as MMLU‑Pro, GPQA, MathVista, and MATH‑500, demonstrating the effectiveness of the NoPE design, extensive multilingual data (30 trillion tokens across 200 languages), and the refined training pipeline.

All models are available on HuggingFace (https://huggingface.co/collections/meta-llama/llama-4-67f0c30d9fe03840bc9d0164). The article concludes with a disclaimer that the material originates from Meta.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model AI research training pipeline context window Llama 4 NoPE

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.