4 min read

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

Ring-mini-2.0 is a high‑performance inference MoE model that activates only 1.4 B parameters out of 16 B total, achieving dense‑model quality below 10 B while supporting 128 K context length and ultra‑fast generation speeds of over 300 tokens/s.

AntTech

Sep 14, 2025

Ring-mini-2.0: How a 16B MoE Model Delivers 128K Context and 500+ Tokens/s

Today we officially release Ring-mini-2.0 – a high‑performance inference MoE model built on the Ling-mini-2.0 architecture. With a total of 16 B parameters but only 1.4 B active at inference, it matches the comprehensive reasoning ability of dense models under 10 B, excelling in logical reasoning, code, and mathematics, while supporting 128 K long context and generation speeds of 300+ tokens/s.

Reinforced Reasoning: Stable Large‑Scale RL Training

Ring-mini-2.0 continues training from Ling-mini-2.0‑base using Long‑COT SFT, more stable large‑scale RLVR, and joint RLHF optimization, markedly improving the stability and generalization of complex reasoning. On challenging benchmarks such as LiveCodeBench, AIME 2025, GPQA, and ARC‑AGI‑v1, it outperforms dense models under 10 B and rivals larger MoE models like gpt‑oss‑20B‑medium, especially in logical reasoning.

Highly Sparse, High‑Speed Generation

Leveraging the efficient MoE design of the Ling 2.0 series, Ring-mini-2.0 activates only 1.4 B parameters via a 1/32 expert activation ratio and MTP‑layer optimizations, delivering performance comparable to a 7–8 B dense model. Deployed on H20, it achieves throughput of 300+ tokens/s, which can be boosted to 500+ tokens/s with Expert Dual Streaming, dramatically lowering inference cost in high‑concurrency scenarios. With YaRN extrapolation it supports 128 K context, offering up to 7× acceleration for long‑output tasks.

Full Open Source: Model Weights, Training Strategy, and Data

We are releasing the complete model weights, training data, and RLVR+RLHF training strategy for Ring-mini-2.0. Its "small‑but‑excellent" characteristics make it an ideal starting point for both academic research and industrial applications.

Visit our open‑source repositories to download and use the model. Under the Ling 2.0 architecture we will continue to release larger, faster, and better language and multimodal models.

HuggingFace: https://huggingface.co/inclusionAI/Ring-mini-2.0

ModelScope: https://modelscope.cn/models/inclusionAI/Ring-mini-2.0

AI inference optimization MoE open-source

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.