LLaDA‑MoE: The First Native MoE Diffusion Language Model Shattering Autoregressive Limits
Ant Group and Renmin University unveiled LLaDA‑MoE, the industry’s first native MoE‑based diffusion language model trained on 20 TB of data, achieving performance comparable to Qwen2.5 while delivering several‑fold faster inference, and the model will be fully open‑sourced to accelerate global AI research.
Ant Group and Renmin University jointly developed the native MoE architecture diffusion language model (dLLM) LLaDA‑MoE, training it from scratch on roughly 20 TB of data. The experiment confirms industrial‑scale training scalability and stability, and the model outperforms previous dense diffusion models LLaDA‑1.0/1.5 and Dream‑7B, matching the capabilities of comparable autoregressive models while offering multiple times faster inference. The model will be fully open‑sourced soon to advance the global AI community.
On September 11 at the 2025Inclusion·外滩大会, the release was presented by Li Chongxuan, associate professor at Renmin University’s Gaoling AI Institute, and Lan Zhenzhong, director of Ant Group’s General AI Research Center.
Using a non‑autoregressive masked diffusion mechanism, LLaDA‑MoE is the first model to achieve language intelligence on par with Qwen2.5—including context learning, instruction following, code generation, and mathematical reasoning—thereby challenging the prevailing belief that large language models must be autoregressive.
Lan Zhenzhong noted that LLaDA‑MoE validates the scalability and stability of industrial‑grade large‑scale training, marking a step forward for expanding dLLM to even larger scales.
Li Chongxuan explained that the dominant autoregressive paradigm limits models to unidirectional token generation, making it hard to capture bidirectional dependencies, a fundamental issue that LLaDA‑MoE addresses.
Motivated by these challenges, the Ant‑Renmin team spent three months rewriting the training code on top of LLaDA‑1.0, leveraging Ant’s proprietary distributed framework ATorch for expert‑parallel (EP) acceleration, and building on the Ling2.0 base model. They solved core problems such as load balancing and noise‑sampling drift, ultimately training a 7B‑A1B MoE model (7 B total parameters, 1.4 B active) on the 20 TB dataset.
Under Ant’s unified evaluation framework, LLaDA‑MoE improves average performance by 8.4% across 17 benchmarks (HumanEval, MBPP, GSM8K, MATH, IFEval, BFCL, etc.), surpasses LLaDA‑1.5 by 13.2% and matches Qwen2.5‑3B‑Instruct, confirming that the “MoE amplifier” effect also holds for diffusion language models and providing a viable path toward 10 B–100 B sparse models.
The model weights and a custom inference framework will be released to the public shortly. In addition, Ant will open‑source an inference engine specially optimized for dLLM parallelism, which delivers significant speedups over NVIDIA’s fast‑dLLM. The code, model, and technical report will be posted on GitHub and Hugging Face.
Looking ahead, Ant plans to continue investing in dLLM‑driven AGI research, collaborating with academia and the global AI community, and asserts that “autoregression is not the end; diffusion models can also become a main pathway to AGI.”
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
