Artificial Intelligence 13 min read

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

Architect

Feb 24, 2025

Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

On February 18, Moonshot AI released a paper on MoBA, a sparse‑attention framework that draws on Mixture‑of‑Experts ideas to extend large‑language‑model context length up to 10 million tokens while allowing a seamless switch between full and sparse attention for compatibility with existing pretrained models.

Early Design – MoBA v0.5

Development began in late May 2023 shortly after the team was formed. The initial goal was to pre‑train a 16‑billion‑parameter model with a 16 K token context, which quickly expanded to a 128 K requirement. The first architecture featured a two‑layer serial cross‑attention mechanism with a parameter‑free gate and added cross‑attention at every Transformer layer. Context Parallel concepts were incorporated by treating each data‑parallel node as a Mixture‑of‑Experts expert and integrating the early fmoe library into Nvidia’s Megatron‑LM. This version is referred to as MoBA v0.5.

First Revision – MoBA v1

Tim introduced a redesign that replaced the serial two‑layer scheme with a parallel single‑layer attention, eliminating extra parameters and enabling continue‑training on the original model weights. MoBA v1 combined Sparse Attention with Context Parallel, delivering strong end‑to‑end speedups on 3 B and 7 B models. However, training larger scales produced severe loss spikes, and the initial weighted‑sum aggregation proved unstable, leading to debugging difficulties such as the “Attention Sink” imbalance.

To address debugging, the team adopted an Online Softmax gate, which allowed the sparse model to be compared directly with a mathematically equivalent full‑attention baseline by setting sparsity to zero.

Stabilization – MoBA v2

After extensive discussion, the team separated Context Parallel from MoBA, reverting MoBA to a pure Sparse Attention design that can run on a single machine when memory permits. The resulting MoBA v2 matches full‑attention outputs on short sequences, follows a reliable scaling law, and scales smoothly to larger models without the previous loss spikes.

MoBA v2 proved stable in pre‑training, passed extensive debugging, and was eventually deployed to production after confirming that its activation‑only version achieved all‑green results on downstream tests.

Challenges in SFT and Final Adjustments

During the SFT stage, a very sparse loss mask (often <1 % of tokens receiving gradients) caused efficiency drops on long‑document summarization tasks. Removing the loss mask dramatically improved performance, and the team later modified the final layers to use full attention, increasing gradient‑token density and restoring learning efficiency. Experiments showed that this hybrid approach retained sparse‑attention benefits while matching full‑attention metrics at 1 M token length.

Release and FAQ

The final, minimal MoBA implementation is open‑source ( https://github.com/MoonshotAI/MoBA) and accompanied by a technical report ( https://arxiv.org/abs/2502.13189). The FAQ clarifies that MoBA works for decoding, is most effective with multi‑head attention, less so with GQA/MQA, and that a Triton implementation existed but was discontinued due to maintenance cost.

Report: https://arxiv.org/abs/2502.13189

Code: https://github.com/MoonshotAI/MoBA

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts long context AI Architecture LLM training MoBA sparse attention Context Parallel

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.