MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed
MiniCPM‑o 4.5 introduces the world’s first end‑to‑end full‑duplex multimodal 9‑billion‑parameter model, powered by the Omni‑Flow framework, running on a single consumer‑grade GPU with 12 GB memory, and delivers benchmark results that match or surpass Gemini 2.5 Flash while offering open‑source demos, APIs, and a Windows/macOS installer.
MiniCPM‑o 4.5 – End‑to‑end 9 B full‑duplex multimodal model
MiniCPM‑o 4.5 implements the Omni‑Flow streaming multimodal framework, aligning visual, audio, and textual streams on a millisecond‑level timeline. This enables continuous perception, reasoning, and response without external voice‑activity‑detection.
Architecture (9 B parameters)
Vision encoder (0.4 B): SigLIP‑ViT.
Audio encoder (0.3 B): Whisper‑Medium.
LLM base (8 B): Qwen3‑8B.
Voice token decoder (0.3 B): lightweight Llama converting text tokens to speech units.
Vocoder: synthesizes final waveform.
The LLM generates only textual tokens; a dedicated voice decoder handles speech synthesis, preserving language and reasoning capacity.
TAIL – Time‑Aligned Interleaving voice generation
TAIL synchronizes each speech chunk with its corresponding text chunk, avoiding large pre‑read buffers. A lightweight “pre‑look” mechanism ensures cross‑word continuity, achieving low‑delay, natural‑sounding speech.
Inference efficiency
The INT4‑quantized model runs with 11 GB GPU memory and reaches 212 tokens/s, >40 % faster than Qwen3‑Omni, with lower response latency.
Visual benchmark results
OpenCompass: 77.6 (MiniCPM‑o 4.5) vs 78.5 (Gemini 2.5 Flash) vs 75.7 (Qwen3‑Omni‑30B‑A3B).
MMBench EN v1.1: 87.6 vs 86.6 vs 84.9.
MathVista: 80.1 vs 75.3 vs 75.9.
HallusionBench: 63.2 vs 59.1 vs 59.7.
Full‑duplex multimodal benchmarks
Daily‑Omni: 80.2 (MiniCPM‑o 4.5) vs 79.3 (Gemini 2.5 Flash) vs 70.7 (Qwen3‑Omni).
Video‑Holmes: 64.29 vs 51.3 vs 50.4.
LiveSports‑3K‑CC win‑rate: 54.4 % (MiniCPM‑o 4.5); competing models report no result.
Speech quality
Character error rate (CER): 0.86 vs 1.45 (CosyVoice2) vs 1.41 (Qwen3‑Omni).
Word error rate (WER): 2.38 vs 2.57 vs 3.39.
Emotion score (Expresso): 29.8 vs 17.9 (CosyVoice2).
Key components of Omni‑Flow
Omni‑Flow creates a shared timeline that slices visual, audio, and language streams into millisecond‑level slots. In each slot the model performs a perception‑reasoning‑response cycle, enabling natural interruptions and eliminating reliance on external VAD.
Open resources
Technical report PDF: https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/MiniCPM_o_45_technical_report.pdf
Demo repository (includes local installer): https://github.com/OpenBMB/MiniCPM-o-Demo
Model download (Hugging Face): https://huggingface.co/openbmb/MiniCPM-o-4_5
Model download (ModelScope): https://www.modelscope.cn/models/OpenBMB/MiniCPM-o-4_5
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
