DeepSeek Introduces Mega MoE and FP4 Indexer – Inside the New GPU Fusion Kernel

DeepSeek's latest DeepGEMM update adds Mega MoE, a fused GPU kernel that collapses the entire Mixture‑of‑Experts pipeline and overlaps computation with NVLink communication, while also unveiling an FP4 indexer and FP8×FP4 precision experiments, signaling a push toward highly efficient large‑scale AI training.

Machine Heart
Machine Heart
Machine Heart
DeepSeek Introduces Mega MoE and FP4 Indexer – Inside the New GPU Fusion Kernel

After a period of silence, DeepSeek released a DeepGEMM update (PR #304) that adds a new component called Mega MoE and introduces an FP4 indexer for MQA logits.

Mega MoE fuses the entire Mixture‑of‑Experts (MoE) computation flow—dispatch, two linear layers, SwiGLU activation, and combine—into a single "mega‑kernel" that runs on the GPU.

In the traditional MoE design, tokens are dispatched to separate expert kernels, each step launches its own kernel and incurs inter‑GPU data transfers, leading to a pattern of compute‑wait‑communicate‑compute that reduces GPU utilization.

Mega MoE eliminates this inefficiency by merging all steps into one kernel and, crucially, overlapping data communication with computation: while Tensor Cores perform arithmetic, NVLink transfers data simultaneously, removing the "wait" phases.

This design directly improves GPU utilization, especially in multi‑card, large‑scale MoE scenarios, turning a fragmented pipeline into a continuous, high‑throughput pipeline.

Beyond Mega MoE, DeepSeek is experimenting with FP8 × FP4 mixed‑precision arithmetic and a dedicated FP4 indexer for MQA logits, alongside GEMM restructuring and JIT‑accelerated compilation, aiming to push the limits of computational efficiency.

The team notes that Mega MoE is still under development and performance numbers will be released later, indicating ongoing tuning across different scales, topologies, and workloads.

DeepGEMM is a unified high‑performance Tensor Core kernel library that integrates key primitives for modern large language models, including GEMM (FP8, FP4, BF16), communication‑overlapped fused MoE (Mega MoE), MQA scoring for lightning indexer, HyperConnection, all compiled at runtime by a lightweight JIT module without requiring CUDA compilation during installation.

This update represents a foundational infrastructure refactor: DeepSeek is turning MoE from a theoretically appealing but engineering‑heavy architecture into a practical, scalable, and efficient solution.

According to community observations, the hardware used for training likely includes Nvidia's latest B‑series AI accelerators rather than the previously rumored domestic AI cards.

Mixture of ExpertsDeepSeekDeepGEMMFP4 IndexerGPU FusionMega MoE
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.