Artificial Intelligence 6 min read

Xiaomi’s MiMo‑V2.5: Halving Cost, Doubling Efficiency with a New Multimodal LLM

Xiaomi unveiled the MiMo‑V2.5 and MiMo‑V2.5‑Pro large language models, highlighting up to 50% lower API cost, multimodal perception, token‑efficiency gains, benchmark superiority over Claude Opus 4.6 and GPT‑5.4, and real‑world demos that built a full compiler in 4.3 hours and a video‑editing web app in 11.5 hours.

SuanNi

Apr 26, 2026

Xiaomi’s MiMo‑V2.5: Halving Cost, Doubling Efficiency with a New Multimodal LLM

Xiaomi announced the MiMo‑V2.5 series of large language models, comprising the standard MiMo‑V2.5 and the higher‑capacity MiMo‑V2.5‑Pro, and confirmed that both models will be released as open‑source globally.

Long‑Task Capability

MiMo‑V2.5‑Pro is described as Xiaomi’s most powerful model to date, matching top‑tier models such as Claude Opus 4.6 and GPT‑5.4 in general‑agent ability, complex software‑engineering tasks, and long‑running tasks.

In a benchmark from Peking University’s "Compiler Principles" course, students normally need several weeks to implement a full SysY compiler in Rust. MiMo‑V2.5‑Pro completed the same project in 4.3 hours, invoking tools 672 times and achieving a perfect 233‑point score on the hidden test set.

The model first constructed a complete pipeline skeleton, then tackled core modules layer by layer. It earned full marks on Koopa IR code generation (110 points), RISC‑V backend generation (103 points), and performance optimisation (20 points). The initial compilation success rate reached 59 %, and when a regression occurred at iteration 512, the model automatically diagnosed and recovered the code.

Given a simple prompt “build a video‑editing web application”, MiMo‑V2.5‑Pro worked autonomously for 11.5 hours, performed 1 868 tool calls, and produced a fully functional web app consisting of 8 192 lines of code, including multi‑track timelines, clip trimming, cross‑fade, audio mixing, and export functionality.

Native Multimodal Ability and Efficiency Leap

The base MiMo‑V2.5 model is a native multimodal LLM that can see, hear, and read, converting perception directly into actions. In the Claw‑Eval benchmark, its agent capability surpasses the previous MiMo‑V2‑Pro while reducing API‑call cost by roughly 50 %.

Across multimodal benchmarks such as VideoMME, CharXiv, and MMMU‑Pro, the series demonstrates performance that approaches or exceeds leading closed‑source models.

Both versions feature deep token‑efficiency optimisations. At equal ClawEval scores, MiMo‑V2.5‑Pro uses 42 % fewer tokens than Kimi K2.6, and the base MiMo‑V2.5 saves about 50 % compared with Muse Spark. Each model supports roughly one‑million token context length; the Pro version focuses on long‑cycle, high‑difficulty agent tasks, while the base version covers the majority of general scenarios.

All MiMo‑V2.5 models will be open‑sourced worldwide.