How Xiaomi’s MiMo‑V2.5‑Pro UltraSpeed Achieves 1 T‑Parameter, 1000 Tokens/s Generation
Xiaomi’s MiMo‑V2.5‑Pro UltraSpeed delivers a 1‑trillion‑parameter model that generates over 1000 tokens per second on a standard 8‑GPU server by combining FP4 quantization, MoE architecture, DFlash decoding and TileRT’s custom execution engine, challenging the need for dedicated ASICs.
MiMo’s team, in collaboration with TileRT, announced MiMo‑V2.5‑Pro‑UltraSpeed, a 1‑trillion‑parameter model that reaches 1000‑1200 tokens/s on a conventional 8‑GPU server, disproving the common belief that such throughput requires specialized hardware.
Industry Context
Typical ultra‑fast inference solutions, such as Cerebras’s wafer‑scale chips or Groq’s SRAM‑centric designs, rely on custom silicon. Xiaomi instead pursues a software‑centric path, achieving comparable speed through deep codesign of model and runtime on commodity GPUs.
Model Quantization and Architecture
The model uses the MXFP4 format, a 4‑bit quantization that halves memory footprint and bandwidth. Only the Mixture‑of‑Experts (MoE) experts are quantized; attention, normalization and other critical layers retain full precision. Quantization‑aware training (FP4 QAT) simulates the low‑bit loss during training, keeping overall capability close to the original version.
Decoding Innovation – DFlash
MiMo replaces the traditional serial draft model with DFlash, a block‑masked parallel predictor. Instead of generating draft tokens one by one, DFlash fills an entire masked block in a single forward pass, removing the serial bottleneck of speculative decoding. The draft model employs Sliding Window Attention (SWA) that aligns with the MiMo‑V2 series, reducing per‑prediction cost to a constant independent of context length.
Block size is limited to 8, which lowers verification overhead and raises concurrency. In coding scenarios the average acceptance length reaches 6.30 (max 7.14), meaning 6‑7 of the 8 draft tokens are accepted per verification round, while general dialogue still lags behind and is under active optimisation.
TileRT Execution Engine
TileRT introduces a Persistent Engine Kernel that eliminates per‑operator launch latency, keeping the entire compute pipeline resident on the GPU. Warp Specialization further splits communication, data movement and tensor computation across warps, turning the GPU into a continuously flowing heterogeneous execution system and eradicating microsecond‑scale execution gaps.
Impact on Real‑Time Applications
At 1000 tokens/s, each operator’s lifetime shrinks to microseconds, enabling millisecond‑level response for time‑sensitive tasks such as high‑frequency trading signal generation, real‑time fraud interception, intelligent bidding and interactive dialogue. Coding agents benefit dramatically: a full module can be generated and verified in seconds, reducing developer wait time from minutes to seconds. The authors cite examples like building a Snake game in 10 seconds and replicating a macOS UI in one minute.
Availability
The MiMo‑V2.5‑Pro‑UltraSpeed model, together with FP4 weights and DFlash parameters, is open‑sourced on HuggingFace, and the UltraSpeed variant is slated for future release.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
