MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed

MiniCPM‑o 4.5 introduces the world’s first end‑to‑end full‑duplex multimodal 9‑billion‑parameter model, powered by the Omni‑Flow framework, running on a single consumer‑grade GPU with 12 GB memory, and delivers benchmark results that match or surpass Gemini 2.5 Flash while offering open‑source demos, APIs, and a Windows/macOS installer.

PaperAgent
PaperAgent
PaperAgent
MiniCPM‑o 4.5 Achieves Full‑Duplex Multimodal AI That DeepSeek V4 Missed

MiniCPM‑o 4.5 – End‑to‑end 9 B full‑duplex multimodal model

MiniCPM‑o 4.5 implements the Omni‑Flow streaming multimodal framework, aligning visual, audio, and textual streams on a millisecond‑level timeline. This enables continuous perception, reasoning, and response without external voice‑activity‑detection.

Architecture (9 B parameters)

Vision encoder (0.4 B): SigLIP‑ViT.

Audio encoder (0.3 B): Whisper‑Medium.

LLM base (8 B): Qwen3‑8B.

Voice token decoder (0.3 B): lightweight Llama converting text tokens to speech units.

Vocoder: synthesizes final waveform.

The LLM generates only textual tokens; a dedicated voice decoder handles speech synthesis, preserving language and reasoning capacity.

TAIL – Time‑Aligned Interleaving voice generation

TAIL synchronizes each speech chunk with its corresponding text chunk, avoiding large pre‑read buffers. A lightweight “pre‑look” mechanism ensures cross‑word continuity, achieving low‑delay, natural‑sounding speech.

Inference efficiency

The INT4‑quantized model runs with 11 GB GPU memory and reaches 212 tokens/s, >40 % faster than Qwen3‑Omni, with lower response latency.

Visual benchmark results

OpenCompass: 77.6 (MiniCPM‑o 4.5) vs 78.5 (Gemini 2.5 Flash) vs 75.7 (Qwen3‑Omni‑30B‑A3B).

MMBench EN v1.1: 87.6 vs 86.6 vs 84.9.

MathVista: 80.1 vs 75.3 vs 75.9.

HallusionBench: 63.2 vs 59.1 vs 59.7.

Full‑duplex multimodal benchmarks

Daily‑Omni: 80.2 (MiniCPM‑o 4.5) vs 79.3 (Gemini 2.5 Flash) vs 70.7 (Qwen3‑Omni).

Video‑Holmes: 64.29 vs 51.3 vs 50.4.

LiveSports‑3K‑CC win‑rate: 54.4 % (MiniCPM‑o 4.5); competing models report no result.

Speech quality

Character error rate (CER): 0.86 vs 1.45 (CosyVoice2) vs 1.41 (Qwen3‑Omni).

Word error rate (WER): 2.38 vs 2.57 vs 3.39.

Emotion score (Expresso): 29.8 vs 17.9 (CosyVoice2).

Key components of Omni‑Flow

Omni‑Flow creates a shared timeline that slices visual, audio, and language streams into millisecond‑level slots. In each slot the model performs a perception‑reasoning‑response cycle, enabling natural interruptions and eliminating reliance on external VAD.

Open resources

Technical report PDF: https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/MiniCPM_o_45_technical_report.pdf

Demo repository (includes local installer): https://github.com/OpenBMB/MiniCPM-o-Demo

Model download (Hugging Face): https://huggingface.co/openbmb/MiniCPM-o-4_5

Model download (ModelScope): https://www.modelscope.cn/models/OpenBMB/MiniCPM-o-4_5

AIopen-sourcebenchmarkmultimodalfull-duplexMiniCPM-o
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.