How Tsinghua & Tencent Mixed‑X Won the MLSys 2026 MoE Inference Challenge with a 4.1× Speedup

The Tsinghua‑Tencent Mixed‑X team captured the MLSys 2026 MoE inference optimization championship by analyzing NPU bottlenecks, redesigning data movement, applying expert‑level sharding, continuous DMA, PSUM batching, and an Agent‑based optimizer, achieving a 4.1× end‑to‑end speedup while preserving bit‑level output fidelity.

Tencent Technical Engineering
Tencent Technical Engineering
Tencent Technical Engineering
How Tsinghua & Tencent Mixed‑X Won the MLSys 2026 MoE Inference Challenge with a 4.1× Speedup

MLSys 2026 hosted a MoE inference optimization contest that required participants to write custom operators for a designated NPU, improve end‑to‑end latency, and keep output results bit‑aligned with the baseline. The Tsinghua‑Tencent Mixed‑X team won by delivering a full‑stack solution that accelerated inference speed by 4.1×.

1. Performance analysis

Initial profiling on the NPU showed the MoE module consumed ~0.2 ms per layer, accounting for over 82 % of the per‑step generation latency. The dominant cost came from weight loading and intermediate data movement. DMA bandwidth was under‑utilized, and the attention module added another ~40 µs per layer due to frequent layout conversions and data shuffling.

2. Core optimizations

E‑Shard expert splitting : instead of tensor‑parallel hidden‑dimension sharding (96‑dim), the team split tasks by expert across the two cores (LNC0 handles experts 0‑3, LNC1 handles experts 4‑7). This enabled a single continuous DMA transfer for Gate and Up matrices, eliminating fragmented transfers.

PSUM batch read : multiple PSUM storage blocks were packed into a 3‑D tensor and read out with one batch copy, reducing the number of launch instructions and associated overhead.

GEMV path for decoding : a dedicated GEMV pipeline was built to exploit all PSUM banks concurrently, improving parallelism for the decode workload where many matrix‑multiplies degenerate to GEMV.

Scalar‑engine DMA for the first expert : a low‑latency scalar DMA was used to start the first expert’s weight transfer earlier, smoothing the pipeline start‑up.

Compiler prefetch suppression : MoE weights were placed in stack space to block the compiler’s automatic prefetch, reserving DMA bandwidth for the critical path.

3. Precision alignment

The contest required bit‑level output alignment. The baseline used per‑expert FP32→BF16 accumulation with a truncation after each expert, accumulating error across eight experts. The E‑Shard implementation accumulated FP32 results within each core, performed a single cross‑core reduction, and truncated only once, matching the baseline’s accumulation order. RMSNorm, routing, gating, activation, and data‑type conversions were also aligned to the baseline.

4. Agent‑based optimizer “Knight”

Knight automates hypothesis generation, code modification, remote validation, failure attribution, and experience consolidation. It stores iteration logs in SQLite, reuses verified knowledge from a Skills library (NPU compilation, profiling, access‑pattern analysis, etc.), and enforces constraints to prevent reward‑hacking and ensure both correctness and performance.

5. Results

After the optimizations, end‑to‑end inference time dropped from 14.91 s to 3.56 s (≈4.1×). Per‑step decode latency fell from 12.63 ms to 5.45 ms. Single‑layer MoE latency reduced from ~204 µs to ~53 µs, and DMA utilization during weight loading rose to ~80 %.

6. Outlook

The team plans to extend their NPU‑focused inference system to larger‑scale heterogeneous platforms and continue improving the Agent‑driven optimizer by integrating hardware knowledge bases, compiler feedback, and profiling data for sustained multi‑backend performance gains.

Competition banner
Competition banner
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference OptimizationMoEperformance engineeringNPUAgent optimizerMLSys 2026
Tencent Technical Engineering
Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.