Xiaomi’s MiMo‑V2‑Omni: A Full‑Modal Agent Base that Sees, Listens, and Acts
Xiaomi unveiled MiMo‑V2‑Omni, a full‑modal agent base that unifies text, image, video and audio perception with tool‑calling and GUI actions, outperforming leading models such as Gemini 3 Pro and Claude Opus 4.6 on benchmarks, and offering a 256K‑context API for diverse real‑world tasks.
Unified Multimodal Foundation Model
MiMo‑V2‑Omni is built to handle complex real‑world multimodal interaction by integrating text, vision, audio, and video in a single architecture and tightly coupling perception with action. This design removes the traditional “understand‑first, act‑later” limitation and enables native tool invocation, function execution, and GUI manipulation.
Perception Benchmarks
Audio Understanding – Supports environmental sound classification, multi‑speaker separation, audio‑visual joint reasoning, and continuous audio streams longer than 10 hours. Comprehensive performance surpasses Gemini 3 Pro, making it one of the strongest audio‑understanding base models.
Image Understanding – Demonstrates strong multidisciplinary visual reasoning and complex chart analysis, outperforming Claude Opus 4.6 and approaching the level of top closed‑source models such as Gemini 3 Pro.
Video Understanding – Accepts native audio‑video joint input, delivering robust contextual perception and future‑prediction capabilities through an innovative video pre‑training pipeline.
Agent‑Centric Evaluation
In an anonymous “Healer Alpha” release on OpenRouter, usage quickly rose to the platform’s top tier. On the OpenClaw PinchBench leaderboard, MiMo‑V2‑Omni achieved the highest average score, confirming the combined strength of perception and action.
Open API
MiMo‑V2‑Omni is available via an open API with a 256K context window. Pricing: $0.4 per million input tokens, $2 per million output tokens. Access URL: https://platform.xiaomimimo.com.
Demonstrations
Cross‑modal Film Analysis – When given the “guess‑the‑sound” segment from the movie “Good Things,” the model produced a detailed film‑analysis style interpretation, showcasing multimodal metaphor and emotional reasoning.
Long‑Audio Interview – A 7‑hour interview was fed in a single request; the model extracted core arguments and logical flow across the entire duration, illustrating deep long‑audio comprehension.
Browser Automation (Agentic Capability) – Integrated with the OpenClaw framework, the model can autonomously browse the web: it searches product reviews on Xiaohongshu, compares specifications, switches to JD.com for price comparison, negotiates with live chat agents, and completes the purchase, handling complex page structures and real‑time interactions.
Short‑Video Generation – Given a prompt to create a short TikTok video introducing MiMo‑V2‑Omni, the model designs multiple scenes, synthesizes all audio effects without external assets, renders the video, fills the TikTok upload form, publishes the post, and performs post‑publish actions such as liking and commenting.
Productivity Integration – Connected to WPS Office, the model can generate high‑quality Word documents, structured Excel sheets, formatted PDFs, and complete PowerPoint presentations from brief textual instructions.
Future Directions
Planned extensions include long‑term planning, real‑time streaming perception, multi‑agent collaboration, and deeper integration with the physical world.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
