How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The recent SGLang × MUSA meetup revealed that MUSA’s GPU backend has been merged into SGLang’s official codebase, delivering zero‑learning‑cost integration, performance gains of up to 66 % on DeepSeek‑V4, and a growing ecosystem of adapters, high‑performance kernels, and distributed inference support.

Machine Heart
Machine Heart
Machine Heart
How China’s MUSA GPU Backend Earned Native Support in SGLang’s Mainline

The SGLang × MUSA meetup gathered core developers from the most active large‑model inference projects—SGLang, TileLang, Triton, Mooncake—and highlighted that domestic GPU vendor MUSA is no longer just chasing ecosystem adoption but is now a co‑creator of the global open‑source AI stack.

Following the release of DeepSeek V4, Moore Thread (摩尔线程) quickly validated the model on SGLang, completing a full end‑to‑end adaptation chain from hardware kernels to distributed deployment. Crucially, the MUSA backend was officially added to SGLang’s upstream support list, with the code merged into the main branch.

Key technical outcomes include:

Direct invocation of Moore Thread’s full‑function GPU from SGLang without any third‑party adaptation layer.

The torchada compatibility layer enables a single import to bridge all CUDA primitives to MUSA, eliminating manual code rewrites.

The MATE (MUSA AI Tensor Engine) library supplies high‑performance Attention and GEMM kernels, integrating FlashAttention, FlashMLA, and DeepGEMM interfaces.

Native FP8 support on the MTT S5000 card, with partial GGUF and INT4 quantisation already available.

Distributed inference capabilities covering TP/PP/DP/CP/EP, built on MCCL and a custom Allreduce, with Mooncake‑style PD separation.

Performance measurements on DeepSeek‑V4 show a 56.7 % reduction in token‑to‑first‑token latency and a 65.7 % increase in throughput when using Moore Thread’s tensor‑acceleration engine and FlagOS tuning.

Since January, Moore Thread has submitted 47 pull requests to SGLang, 41 of which have been merged, completing the full stack from environment setup to distributed inference. The roadmap includes runtime LLM support, AOT kernel integration, multimodal generation, Docker/CI pipelines, and broader hardware support (GB200/GB300, AMD, Intel, TPU).

Community contributions were showcased:

SGLang core developer BBuf discussed Prefill‑Decode separation, hierarchical caching, and zero‑overhead speculative decoding.

BAAI’s compiler researcher Xiao Hang presented FlagOS, a Triton‑based ecosystem with 497 operators and cross‑chip performance, achieving up to 4× speed‑up on fused MoE and FP8 GEMM.

TileLang maintainer Tang Zhengju demonstrated how ~50 lines of Tile code can produce FlashAttention‑level kernels with >20× acceleration on attention‑sink operators.

Mooncake contributor Ma Teng explained the distributed inference architecture, zero‑copy RDMA transport, KV‑Cache pooling, and fault‑tolerant EP handling that cuts RL weight‑update sync time from 53 s to 7.2 s.

The article concludes that the real breakthrough is not a single demo but the establishment of a stable, co‑developed ecosystem recognized by the world’s leading open‑source inference projects, positioning domestic GPUs as first‑class participants in the AI production stack.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Open SourceDeepSeekAI inferenceGPUSGLangTileLangMUSA
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.