VoxCPM2: Tokenizer‑Free Multilingual TTS that Creates New Voices from Text

VoxCPM2, an open‑source 2‑billion‑parameter TTS model from OpenBMB, eliminates tokenizers and uses a diffusion‑autoregressive architecture to generate high‑fidelity, controllable speech in 30 languages, supporting voice design from natural‑language prompts and high‑quality voice cloning with just a short reference clip.

AI Explorer
AI Explorer
AI Explorer
VoxCPM2: Tokenizer‑Free Multilingual TTS that Creates New Voices from Text

Introduction

OpenBMB has open‑sourced VoxCPM2, a 2‑billion‑parameter, tokenizer‑free multilingual text‑to‑speech (TTS) model that can craft entirely new voices from plain text descriptions and perform high‑fidelity voice cloning, all under an Apache 2.0 license for free commercial use.

Traditional TTS Pain Points

Conventional TTS pipelines rely on complex multi‑stage processing and discrete tokenization, which often produce robotic, emotion‑less speech. Multilingual support typically requires manual language tags, and voice cloning depends on large collections of high‑quality reference audio, limiting control over style and prosody.

How VoxCPM2 Addresses the Issues

VoxCPM2 adopts a “Tokenizer‑Free” end‑to‑end diffusion‑autoregressive architecture that directly generates continuous speech representations, avoiding information loss from discretization. The core model is built on MiniCPM‑4 with 20 billion parameters (≈2 billion actual parameters) and is trained on over 2 million hours of multilingual speech data, providing a solid foundation for natural synthesis.

Core Highlights

Voice Design : Use natural‑language prompts such as “a gentle middle‑aged male, slow speaking rate” to generate a brand‑new timbre.

Controllable Cloning : Provide a short reference clip to clone a voice while steering emotion, speed, and other style attributes.

30‑Language Direct Synthesis : Input text in any of the supported languages without needing explicit language labels.

48 kHz High‑Fidelity Output : Produce studio‑quality audio directly.

Technical Architecture

The backbone is a diffusion‑autoregressive model that captures the complex distribution of speech signals, enabling highly natural and varied output. Its tokenizer‑free nature preserves signal continuity, eliminating the “stitching” effect.

For high‑quality audio, VoxCPM2 employs AudioVAE V2 as the codec. The asymmetric design accepts 16 kHz reference audio and outputs 48 kHz audio, with an integrated super‑resolution module that removes the need for separate up‑sampling steps.

The model also features context‑aware capabilities: it automatically infers appropriate prosody and emotion from the input text, making “surprise!” sound genuinely surprised and “quiet.” sound calm. Combined with Nano‑VLLM acceleration, synthesis runs near real‑time on an RTX 4090.

Quick Start

Developers can try the online demo on Hugging Face or ModelScope, or install the package locally:

pip install voxcpm
from voxcpm import VoxCPM2; model = VoxCPM2.from_pretrained('openbmb/VoxCPM2')

With a few lines of code, you can generate a voice from a description, e.g., “A cheerful young woman with a slight British accent,” or perform voice cloning by loading a short reference audio and providing new text.

Application Scenarios

Content creators & media : Rapidly produce multilingual dubbing for videos or podcasts, and design unique voices for virtual characters, cutting audio production costs.

Game & animation development : Generate expressive speech for large numbers of NPCs or characters, supporting dynamic dialogue.

Audiobooks & education : Convert textual material into natural, multi‑language audio, even using a “teacher” voice for different subjects.

AI assistants & interactive agents : Provide highly natural, customizable speech for chatbots, virtual humans, and customer‑service avatars.

Research & developers : The Apache 2.0 license permits free commercial use, making VoxCPM2 an excellent baseline for speech synthesis and multimodal research.

Open Source and Future Outlook

OpenBMB releases the full model weights, training code, and inference code, accompanied by detailed documentation and active community support on Feishu and Discord, dramatically lowering the barrier to research and deployment.

Looking ahead, the tokenizer‑free architecture shows great potential. As model scale and training data continue to grow, we can expect even more expressive, controllable, and diverse voice generation in the next generation of TTS systems. VoxCPM2 opens the door to an era where speech synthesis and human creativity merge without limits.

diffusion modelopen-sourceTTSmultilingualvoice cloningAudioVAEVoxCPM2
AI Explorer
Written by

AI Explorer

Stay on track with the blogger and advance together in the AI era.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.