Artificial Intelligence 6 min read

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0, an open-source 16 B MoE language model that activates only 1.4 B parameters, achieves dense-level performance with 7× efficiency, generates over 300 tokens / s, and introduces the first FP8 mixed-precision training suite, offering multiple pre-training checkpoints for the AI community.

AntTech

Sep 11, 2025

Ling-mini-2.0: How a 16B MoE Model Achieves Dense-Level Performance with Only 1.4B Active Parameters

Ling-mini-2.0 Overview

We officially open-source Ling 2.0, a Mixture-of-Experts (MoE) large language model series that combines state-of-the-art performance with high efficiency.

Model Specs and Performance

The first released model, Ling-mini-2.0 , has 16 B total parameters but activates only 1.4 B (non-embedding 789 M) per input token. It was pretrained on more than 20 T high-quality tokens and fine-tuned with multi-stage supervised learning and reinforcement learning, reaching the level of dense LLMs under 10 B and even larger MoE models.

Strong General and Specialized Reasoning

On challenging code (LiveCodeBench, CodeForces), mathematics (AIME 2025, HMMT 2025), and multi-subject knowledge tasks (MMLU‑Pro, Humanity’s Last Exam), Ling-mini-2.0 outperforms dense models under 10 B and comparable-size MoE models.

7× Dense-Level Performance Leverage

Guided by the Ling Scaling Laws, Ling-mini-2.0 uses a 1/32 activation-ratio MoE architecture with expert granularity, shared-expert proportion, attention proportion, aux-loss-free + sigmoid routing balance, MTP layer, QK-Norm, half‑RoPE, etc., achieving more than 7× the effective performance of a dense model of similar size (≈7–8 B dense).

300+ token/s High-Speed Generation

Thanks to the highly sparse small-activation MoE design, Ling-mini-2.0 can generate over 300 tokens / s in H20 deployment for queries under 2000 tokens, more than twice the speed of an 8 B dense model. With YaRN extrapolation it supports 128 K context and can reach up to 7× acceleration as output length grows.

First Open-Source FP8 Efficient Training Scheme

Ling-mini-2.0 is trained with FP8 mixed-precision throughout the pipeline. Compared with BF16, FP8 shows almost identical loss curves and benchmark results after training on over 1 T tokens. The released FP8 training suite adds tile/blockwise FP8 scaling, FP8 optimizer, on-demand transpose weight, FP8 padding routing map, and other memory-optimizing techniques. On 8/16/32‑GPU 80 GB setups, Ling-mini-2.0 gains 30–60 % throughput with MTP enabled and 90–120 % without MTP, surpassing LLaMA 3.1 8B and Qwen3 8B.

More Open Model Release

We view Ling-mini-2.0 as an ideal starting point for MoE research. It integrates 1/32 sparsity, MTP layer, and FP8 training in a compact model. In addition to the main model, we release five pre-training checkpoints: Ling-mini-2.0-base and four base models trained on 5 T, 10 T, 15 T, and 20 T tokens.

Visit our open-source repository and HuggingFace Space to download or try the model. Future releases will bring larger, faster, and better language and multimodal models.

MoE open-source FP8 training Efficient Inference

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.