Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters
Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.
Ling-mini-2.0 Overview
We officially open-source Ling 2.0, a Mixture-of-Experts (MoE) large language model series that combines state-of-the-art performance with high efficiency.
Model Specs and Performance
The first released model, Ling-mini-2.0 , has 16 B total parameters but activates only 1.4 B (non-embedding 789 M) per input token. It was pretrained on more than 20 T high-quality tokens and fine-tuned with multi-stage supervised learning and reinforcement learning, reaching the level of dense LLMs under 10 B and even larger MoE models.
Strong General and Specialized Reasoning
On challenging code (LiveCodeBench, CodeForces), mathematics (AIME 2025, HMMT 2025), and multi-subject knowledge tasks (MMLU‑Pro, Humanity’s Last Exam), Ling-mini-2.0 outperforms dense models under 10 B and comparable-size MoE models.
7× Dense-Level Performance Leverage
Guided by the Ling Scaling Laws, Ling-mini-2.0 uses a 1/32 activation-ratio MoE architecture with expert granularity, shared-expert proportion, attention proportion, aux-loss-free + sigmoid routing balance, MTP layer, QK-Norm, half‑RoPE, etc., achieving more than 7× the effective performance of a dense model of similar size (≈7–8 B dense).
300+ token/s High-Speed Generation
Thanks to the highly sparse small-activation MoE design, Ling-mini-2.0 can generate over 300 tokens / s in H20 deployment for queries under 2000 tokens, more than twice the speed of an 8 B dense model. With YaRN extrapolation it supports 128 K context and can reach up to 7× acceleration as output length grows.
First Open-Source FP8 Efficient Training Scheme
Ling-mini-2.0 is trained with FP8 mixed-precision throughout the pipeline. Compared with BF16, FP8 shows almost identical loss curves and benchmark results after training on over 1 T tokens. The released FP8 training suite adds tile/blockwise FP8 scaling, FP8 optimizer, on-demand transpose weight, FP8 padding routing map, and other memory-optimizing techniques. On 8/16/32‑GPU 80 GB setups, Ling-mini-2.0 gains 30–60 % throughput with MTP enabled and 90–120 % without MTP, surpassing LLaMA 3.1 8B and Qwen3 8B.
More Open Model Release
We view Ling-mini-2.0 as an ideal starting point for MoE research. It integrates 1/32 sparsity, MTP layer, and FP8 training in a compact model. In addition to the main model, we release five pre-training checkpoints: Ling-mini-2.0-base and four base models trained on 5 T, 10 T, 15 T, and 20 T tokens.
Visit our open-source repository and HuggingFace Space to download or try the model. Future releases will bring larger, faster, and better language and multimodal models.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
