Xiaomi MiMo-V2-Flash Open‑Source: Ultra‑Efficient Inference and Agent‑Ready Model

Xiaomi's MiMo-V2-Flash, a 309B MoE model with hybrid attention and Multi‑Token Prediction acceleration, delivers top‑2 global agent benchmark scores, up to 2× faster inference, and only 2.5% of the cost of comparable closed‑source models, while being fully open‑source.

Xiaomi Tech
Xiaomi Tech
Xiaomi Tech
Xiaomi MiMo-V2-Flash Open‑Source: Ultra‑Efficient Inference and Agent‑Ready Model

Model Overview Xiaomi MiMo-V2-Flash is a 309‑billion‑parameter (15B active) Mixture‑of‑Experts model built for extreme inference efficiency. It ranks in the global top‑2 among open‑source models on several Agent evaluation benchmarks.

Code Capability and Cost The model’s code generation ability surpasses all open‑source alternatives and rivals the closed‑source Claude 4.5 Sonnet, yet its inference price is only 2.5% of Claude’s and its generation speed is roughly double.

Inference Cost and Speed API pricing is 0.7 CNY per million input tokens and 2.1 CNY per million output tokens. Comparative charts show MiMo‑V2‑Flash achieving lower cost and higher throughput than other leading models.

Hybrid Attention Architecture The core structure combines a 5:1 Sliding Window Attention (SWA) with Global Attention (GA) using a 128‑token window and expands native 32K context to 256K during training. Experiments indicate SWA is simple, efficient, and provides a fixed‑size KV cache that integrates easily with existing training and inference frameworks.

MTP Inference Acceleration Multi‑Token Prediction (MTP) training enhances the base model, and during inference MTP tokens are verified in parallel, breaking the traditional decoding memory‑bandwidth bottleneck. With three MTP layers, the model achieves a 2.8–3.6× increase in receptive length and a 2.0–2.6× actual speedup.

Overall Performance Gains Deep integration of the novel architecture with training and inference infrastructure allows tuning batch size and MTP layer count to maximize GPU utilization, delivering higher throughput, low latency, and extreme inference performance across hardware platforms.

Reinforcement Learning Suitability MiMo‑V2‑Flash excels at efficient RL training. It supports small‑batch on‑policy RL while mitigating GPU idle time caused by long‑tail samples. MTP’s token‑level parallelism makes small‑batch on‑policy RL both stable and efficient, and improves attention and feed‑forward network efficiency during later decoding stages.

New Post‑Training Paradigm: MOPD The authors propose Multi‑Teacher On‑Policy Distillation (MOPD) to scale RL computation in the post‑training phase. MOPD uses on‑policy learning where the student samples from its own policy and is optimized with dense and token‑level rewards from multiple teachers. It achieves stable training with less than 1/50 of the compute required by traditional SFT+RL pipelines while matching teacher peak performance. The decoupled design permits easy addition of new teachers and outcome‑reward models, enabling a self‑reinforcing loop where distilled students become stronger teachers.

Open‑Source Release Model weights and inference code are released on HuggingFace under the MIT license, with the technical report available on GitHub. The inference code is also shared on SGLang. Community single‑node benchmarks report a prefill throughput of ~50,000 tokens/s across various context lengths, and with three MTP layers at a 16K context length, decode throughput reaches 5,000–15,000 tokens/s while maintaining per‑request throughput of 151–115 tokens/s.

API and Demo The API is temporarily free, and a web demo is live. Users can access the platform at platform.xiaomimimo.com and try the model via MiMo Studio Web ( aistudio.xiaomimimo.com).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

open-source LLMMTPHybrid AttentionEfficient InferenceMiMo-V2-FlashMOPD
Xiaomi Tech
Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.