Domestic GPU Trains AI to Write Its Own Kernels—Moore Threads Tops KernelBench

MusaCoder‑27B‑RL, the first open‑source large model fully trained on a domestic GPU stack, achieved an 88.6% pass rate on the Stanford‑Princeton KernelBench benchmark and outperformed leading foreign models by delivering at least 1.1× speedup over baseline kernels.

Machine Heart
Machine Heart
Machine Heart
Domestic GPU Trains AI to Write Its Own Kernels—Moore Threads Tops KernelBench

Moore Threads recently announced that its MusaCoder‑27B‑RL model secured first place on the AI‑generated GPU kernel benchmark KernelBench, surpassing models such as Claude Opus, GLM‑5.1 and Kimi K2.6. The model was released and open‑sourced only a week earlier.

MusaCoder is a specialized large language model designed to translate PyTorch code into native CUDA and MUSA kernels, automating the creation of low‑level GPU operators and lowering the barrier for developers to hand‑write such kernels. It is the first open‑source code model trained end‑to‑end on a domestic GPU platform, with the entire post‑training pipeline executed on the MTT S5000‑based "夸娥" AI compute cluster.

KernelBench, introduced by Stanford and Princeton in 2025, provides a real‑world engineering environment for measuring a model’s ability to generate efficient GPU kernels. The benchmark defines a task as: given a PyTorch model architecture, generate a custom C/C++ CUDA kernel that replaces the original operator and yields actual performance gains. It contains over 250 PyTorch tasks across four difficulty levels, from basic operators (convolution, matrix multiplication) to production‑grade model optimizations. The benchmark requires not only functional correctness but also a speedup above a user‑defined threshold.

When KernelBench launched, DeepSeek R1 achieved only a 30% pass rate (i.e., code that could run). MusaCoder now reaches an 88.6% pass rate, and its generated kernels consistently outperform the baseline by at least 1.1×. The paper "MusaCoder: Native GPU Kernel Generation with Full‑Stack Training on Moore Threads GPU" (arXiv:2606.04847) and the model weights on HuggingFace are referenced for reproducibility.

The core of MusaCoder’s success is the MooreEval execution‑verification protocol. MooreEval is a scalable, distributed evaluation environment that compiles, runs, and profiles generated kernels, providing structured feedback and reward signals for reinforcement learning. Each candidate kernel passes through staged checks: interface and compilation, correctness, anti‑cheat detection, and performance testing. Only kernels that clear a stage proceed further. A hierarchical reward function scores the code and emits detailed diagnostics that are transformed into natural‑language feedback for the model.

MooreEval’s architecture decouples compilation (CPU‑bound) from execution (GPU‑bound) via asynchronous workers, enabling high‑throughput validation on the "夸娥" cluster. This system acts as an automatic examiner that ensures generated kernels are syntactically correct, compile without tricks, and deliver real speedups.

MusaCoder’s full‑stack post‑training pipeline further enhances its capabilities. Data construction follows a three‑stage pipeline: (1) aggregating real‑world GitHub code and NNSmith‑generated graphs to build a large PyTorch‑CUDA/MUSA task set with injected GPU programming fundamentals; (2) adding explicit shape information and structured reasoning to improve the model’s understanding of tensor layouts and indexing; (3) multi‑round interaction where the model receives compilation errors, runtime crashes, and performance bottlenecks, enabling targeted repairs and mitigating sparse‑reward issues.

To stabilize reinforcement learning, three mechanisms are introduced: PrimeEcho (a multi‑round reward anchored on the first‑round generation quality), Buffered Dynamic Retry (BDR) that repackages completely failed samples into a dynamic cache for low‑probability learning, and MirrorPop, a precise filter that discards high‑risk samples. Experiments show each mechanism contributes measurable performance gains.

Beyond the model itself, the achievement demonstrates that a domestically built GPU cluster can support the full lifecycle of large‑model training, from supervised fine‑tuning to reinforcement learning, challenging the notion that Chinese AI compute is limited to inference. The closed‑loop stack—from hardware to software, training platform to evaluation suite—offers a reusable engineering paradigm for future complex AI research and development.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI code generationKernelBenchChinese GPUGPU kernel generationMooreEvalMusaCoder
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.