How Xiaomi’s MiMo‑V2.5‑Pro UltraSpeed Achieves 1 T‑Parameter, 1000 Tokens/s Generation

Xiaomi’s MiMo‑V2.5‑Pro UltraSpeed delivers a 1‑trillion‑parameter model that generates over 1000 tokens per second on a standard 8‑GPU server by combining FP4 quantization, MoE architecture, DFlash decoding and TileRT’s custom execution engine, challenging the need for dedicated ASICs.

SuanNi
SuanNi
SuanNi
How Xiaomi’s MiMo‑V2.5‑Pro UltraSpeed Achieves 1 T‑Parameter, 1000 Tokens/s Generation

MiMo’s team, in collaboration with TileRT, announced MiMo‑V2.5‑Pro‑UltraSpeed, a 1‑trillion‑parameter model that reaches 1000‑1200 tokens/s on a conventional 8‑GPU server, disproving the common belief that such throughput requires specialized hardware.

Industry Context

Typical ultra‑fast inference solutions, such as Cerebras’s wafer‑scale chips or Groq’s SRAM‑centric designs, rely on custom silicon. Xiaomi instead pursues a software‑centric path, achieving comparable speed through deep codesign of model and runtime on commodity GPUs.

Model Quantization and Architecture

The model uses the MXFP4 format, a 4‑bit quantization that halves memory footprint and bandwidth. Only the Mixture‑of‑Experts (MoE) experts are quantized; attention, normalization and other critical layers retain full precision. Quantization‑aware training (FP4 QAT) simulates the low‑bit loss during training, keeping overall capability close to the original version.

Decoding Innovation – DFlash

MiMo replaces the traditional serial draft model with DFlash, a block‑masked parallel predictor. Instead of generating draft tokens one by one, DFlash fills an entire masked block in a single forward pass, removing the serial bottleneck of speculative decoding. The draft model employs Sliding Window Attention (SWA) that aligns with the MiMo‑V2 series, reducing per‑prediction cost to a constant independent of context length.

Block size is limited to 8, which lowers verification overhead and raises concurrency. In coding scenarios the average acceptance length reaches 6.30 (max 7.14), meaning 6‑7 of the 8 draft tokens are accepted per verification round, while general dialogue still lags behind and is under active optimisation.

TileRT Execution Engine

TileRT introduces a Persistent Engine Kernel that eliminates per‑operator launch latency, keeping the entire compute pipeline resident on the GPU. Warp Specialization further splits communication, data movement and tensor computation across warps, turning the GPU into a continuously flowing heterogeneous execution system and eradicating microsecond‑scale execution gaps.

Impact on Real‑Time Applications

At 1000 tokens/s, each operator’s lifetime shrinks to microseconds, enabling millisecond‑level response for time‑sensitive tasks such as high‑frequency trading signal generation, real‑time fraud interception, intelligent bidding and interactive dialogue. Coding agents benefit dramatically: a full module can be generated and verified in seconds, reducing developer wait time from minutes to seconds. The authors cite examples like building a Snake game in 10 seconds and replicating a macOS UI in one minute.

Availability

The MiMo‑V2.5‑Pro‑UltraSpeed model, together with FP4 weights and DFlash parameters, is open‑sourced on HuggingFace, and the UltraSpeed variant is slated for future release.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language ModelMiMoDFlashFP4 quantizationTileRTUltraSpeed
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.