Artificial Intelligence 12 min read

Mega MoE vs SonicMoE: Which Will Lead the Next AI Speed Race?

SonicMoE, a new ultra‑fast Mixture‑of‑Experts model from Tri Dao and Ion Stoica’s team, achieves peak throughput on Nvidia Blackwell GPUs, outperforms DeepSeek’s DeepGEMM, and introduces algorithmic redesigns that decouple activation memory from expert granularity while fusing I/O‑aware kernels for up to double the speed of existing MoE frameworks.

Machine Heart

May 4, 2026

Mega MoE vs SonicMoE: Which Will Lead the Next AI Speed Race?

Tri Dao (co‑author of FlashAttention) and Ion Stoica’s joint research team have released SonicMoE, an ultra‑fast Mixture‑of‑Experts (MoE) model that runs at peak throughput on Nvidia Blackwell GPUs and surpasses DeepSeek’s previously open‑sourced DeepGEMM.

The official blog, code repository, and paper are linked below:

Blog: https://tridao.me/blog/2026/sonicmoe-blackwell/

Code: https://github.com/Dao-AILab/sonic-moe

Paper: https://arxiv.org/abs/2512.14080

To understand SonicMoE’s purpose, one must first grasp the MoE architecture: a large model contains many expert sub‑networks, and each input token is routed to only a small subset of experts, much like a hospital triaging patients to the most suitable specialty. This design reduces compute while enabling massive parameter counts, as seen in Mixtral 8x22B, DeepSeek V3.2, Kimi K2.5, and Qwen3.

Over the past two years, expert granularity has increased nine‑fold, with the fraction of activated experts dropping to one‑tenth of the original. However, finer granularity brings two major walls: (1) **VRAM** – intermediate activations scale with expert count, quickly exhausting GPU memory; (2) **Memory bandwidth** – tiny per‑expert workloads leave the GPU compute under‑utilized while data movement dominates, raising memory‑access intensity up to 12× for typical fine‑grained MoE models such as Qwen3.

Existing open‑source training tools like ScatterMoE and MoMoE struggle with these issues, especially as models become more fine‑grained. SonicMoE was built to eliminate them.

Core innovation 1: Decoupling activation memory from expert granularity – By redesigning the computation order, SonicMoE avoids caching any intermediate tensors proportional to the number of experts. It reorders matrix‑multiplication steps so that required gradients are derived on‑the‑fly, keeping per‑layer activation memory constant even as expert size grows, matching the memory footprint of a dense model.

Core innovation 2: I/O‑aware kernel fusion – Operations that were previously separate GPU kernels are merged. The “Gather‑fusion” technique integrates data‑movement directly into the matmul kernel, raising L2 cache hit rates from ~66 % to ~75 % and eliminating an extra memory read/write. Additionally, the SwiGLU activation is fused into the matmul epilogue, and the backward‑pass kernel overlaps data‑movement with computation using Nvidia’s asynchronous execution.

The team also introduced QuACK, a unified software abstraction layer that expresses all MoE matrix‑multiplication kernels as a “main loop + customizable epilogue”. This design allows the same algorithmic logic to run on both Hopper (H100) and Blackwell (B200/B300) GPUs with only minimal, hardware‑specific modifications.

Experimental results on Nvidia B300 GPUs benchmarked six real‑world MoE configurations ranging from 7 B to 685 B parameters (including OLMoE, Qwen3‑235B, DeepSeek‑V3.2). SonicMoE achieved:

54 % higher forward‑throughput and 35 % higher backward‑throughput than DeepSeek’s DeepGEMM (itself a high‑performance baseline).

21 % faster forward pass than the official Triton MoE example.

Up to nearly 2× speed‑up over widely used ScatterMoE and MoMoE frameworks.

Kernel‑level analysis attributes the speed‑up primarily to Gather‑fusion (the dominant factor) and, secondarily, to a faster grouped‑matmul implementation that leverages Blackwell’s CLC scheduler and 2CTA MMA technology, contributing an additional ~10 % gain.

When expert granularity is increased from the Mixtral era to the Kimi K2.5 level, traditional schemes see linear growth in activation memory per layer, whereas SonicMoE’s memory usage remains flat, expanding the feasible design space for future fine‑grained models under limited VRAM.

In conclusion, as hardware scaling slows, software innovations like SonicMoE become crucial “equalizers” for AI progress. The project is open‑sourced on GitHub and PyPI, supports H100 and the latest B200/B300 GPUs, and plans future extensions such as expert‑parallelism, MXFP8/FP4 precision, and support for Nvidia’s upcoming Rubin GPUs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU Acceleration Mixture of Experts AI performance Blackwell Mega MoE QuACK SonicMoE

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.