Xiaomi Unveils MiMo-V2-TTS: Giving Agents a Voice with Soul

Xiaomi introduces MiMo-V2-TTS, a self‑developed speech‑synthesis large model that combines a custom audio tokenizer, multi‑codebook architecture, massive pre‑training on over a hundred million hours of data and multi‑dimensional reinforcement learning to deliver fine‑grained style control, dialect support, role‑play and high‑quality singing, aiming to give AI agents expressive, human‑like voices.

Xiaomi Tech
Xiaomi Tech
Xiaomi Tech
Xiaomi Unveils MiMo-V2-TTS: Giving Agents a Voice with Soul

In the emerging Agent era, intelligent agents need not only perception and action but also expressive vocal ability; Xiaomi therefore releases MiMo-V2-TTS, a speech‑synthesis large model designed for multimodal interaction.

Model overview : The model is named MiMo-V2-TTS and uses the self‑developed MiMo Audio Tokenizer together with a novel multi‑codebook speech‑modeling architecture that captures fine‑grained acoustic features.

Training pipeline : First, the model undergoes ultra‑large‑scale speech‑text mixed pre‑training on more than 100 million hours of diverse audio, learning cross‑modal alignment and generation. Afterwards, a small set of high‑quality supervised data fine‑tunes the model for multi‑granular, multi‑style instruction control. Multi‑dimensional reinforcement learning then optimizes natural prosody, stable audio quality, accurate phoneme rendering, high‑fidelity voice cloning, and scenario‑appropriate tone.

Generalizable style control : Users can issue natural‑language commands to set a global voice tone and also adjust local emotional nuances within a single sentence, enabling smooth tone shifts and emotional progression.

Robust text understanding : By leveraging massive text‑speech alignment data, the model automatically interprets punctuation, interjections, and emphasis markers, converting them into appropriate prosodic patterns without any extra annotation.

Beyond plain speech : MiMo-V2-TTS supports multiple dialects (e.g., Northeastern, Sichuan, Henan, Cantonese, Taiwanese), role‑play style rendering, and high‑quality singing synthesis, demonstrated with example lyrics that preserve pitch and rhythm.

Next steps : The roadmap includes extending coverage to more languages beyond Chinese and English and deeply integrating the TTS with the MiMo-V2-Omni multimodal model, so agents can see, understand, and narrate the world with a lifelike voice.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reinforcement learninglarge modelspeech synthesisstyle controlaudio tokenizermultilingual TTS
Xiaomi Tech
Written by

Xiaomi Tech

Chat about technology with Xiaomi and change life together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.