Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

The article details the development, architectural evolution, and practical challenges of MoBA—a sparse attention framework inspired by Mixture‑of‑Experts that scales LLM context length to 10 M tokens, supports seamless switching between full and sparse attention, and is now released as a minimal open‑source solution.

Architect
Architect
Architect
Inside MoBA: A Sparse Attention Framework for 10‑Million‑Token Contexts

On February 18, Moonshot AI released a paper on MoBA, a sparse‑attention framework that draws on Mixture‑of‑Experts ideas to extend large‑language‑model context length up to 10 million tokens while allowing a seamless switch between full and sparse attention for compatibility with existing pretrained models.

Early Design – MoBA v0.5

Development began in late May 2023 shortly after the team was formed. The initial goal was to pre‑train a 16‑billion‑parameter model with a 16 K token context, which quickly expanded to a 128 K requirement. The first architecture featured a two‑layer serial cross‑attention mechanism with a parameter‑free gate and added cross‑attention at every Transformer layer. Context Parallel concepts were incorporated by treating each data‑parallel node as a Mixture‑of‑Experts expert and integrating the early fmoe library into Nvidia’s Megatron‑LM. This version is referred to as MoBA v0.5.

Simple diagram of MoBA v0.5
Simple diagram of MoBA v0.5

First Revision – MoBA v1

Tim introduced a redesign that replaced the serial two‑layer scheme with a parallel single‑layer attention, eliminating extra parameters and enabling continue‑training on the original model weights. MoBA v1 combined Sparse Attention with Context Parallel, delivering strong end‑to‑end speedups on 3 B and 7 B models. However, training larger scales produced severe loss spikes, and the initial weighted‑sum aggregation proved unstable, leading to debugging difficulties such as the “Attention Sink” imbalance.

To address debugging, the team adopted an Online Softmax gate, which allowed the sparse model to be compared directly with a mathematically equivalent full‑attention baseline by setting sparsity to zero.

Simple diagram of MoBA v1
Simple diagram of MoBA v1

Stabilization – MoBA v2

After extensive discussion, the team separated Context Parallel from MoBA, reverting MoBA to a pure Sparse Attention design that can run on a single machine when memory permits. The resulting MoBA v2 matches full‑attention outputs on short sequences, follows a reliable scaling law, and scales smoothly to larger models without the previous loss spikes.

MoBA v2 proved stable in pre‑training, passed extensive debugging, and was eventually deployed to production after confirming that its activation‑only version achieved all‑green results on downstream tests.

Current MoBA design (v2)
Current MoBA design (v2)

Challenges in SFT and Final Adjustments

During the SFT stage, a very sparse loss mask (often <1 % of tokens receiving gradients) caused efficiency drops on long‑document summarization tasks. Removing the loss mask dramatically improved performance, and the team later modified the final layers to use full attention, increasing gradient‑token density and restoring learning efficiency. Experiments showed that this hybrid approach retained sparse‑attention benefits while matching full‑attention metrics at 1 M token length.

Release and FAQ

The final, minimal MoBA implementation is open‑source ( https://github.com/MoonshotAI/MoBA) and accompanied by a technical report ( https://arxiv.org/abs/2502.13189). The FAQ clarifies that MoBA works for decoding, is most effective with multi‑head attention, less so with GQA/MQA, and that a Triton implementation existed but was discontinued due to maintenance cost.

FAQ header
FAQ header
Report: https://arxiv.org/abs/2502.13189
Code: https://github.com/MoonshotAI/MoBA
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Mixture of Expertslong contextAI ArchitectureLLM trainingMoBAsparse attentionContext Parallel
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.