Zero‑Shot Voice Cloning with Emotion and Duration Control: IndexTTS‑2 Runs Locally
IndexTTS‑2, an open‑source zero‑shot TTS system from B‑Station, enables precise duration control, emotion‑tone separation, and bilingual synthesis, offering a modern uv‑based installation, GPU‑accelerated inference, and benchmark‑leading WER and emotional similarity scores compared to contemporary models.
Overview
IndexTTS-2 is a second‑generation zero‑shot text‑to‑speech model that clones a reference voice and allows independent emotion control while precisely matching a target duration.
Problem
Autoregressive TTS models (e.g., XTTS, CosyVoice) generate natural speech but cannot guarantee exact output length, which is required for video dubbing where timing must align to the visual track.
Core Capabilities
Voice + Emotion : combine a speaker’s timbre with a desired emotional tone.
Exact Duration : specify the target length; token‑count encoding yields an error rate below 0.02 %.
Natural‑Language Emotion Prompt : describe emotions such as "fear" or "surprise" and the model generates speech with the corresponding affect.
Multi‑Modal Emotion Input : reference audio, emotion vectors, or natural‑language description.
Bilingual Support : trained on 55 k h of data (30 k h Chinese, 25 k h English).
Open‑Source Commercial License : Apache 2.0, code and weights publicly available.
Architecture
The system consists of three modules:
Text‑to‑Semantic (T2S) : an autoregressive Transformer that converts text to semantic tokens. A length‑encoding mechanism embeds the desired token count, enabling the model to generate a predetermined number of tokens. Random token zero‑out is applied with 30 % probability; a free‑generation mode is also supported.
Semantic‑to‑Mel (S2M) : a non‑autoregressive Flow‑Matching module that transforms semantic tokens into mel‑spectrograms. It fuses the final hidden state of T2S (GPT‑style hidden‑state enhancement) to reduce phoneme blurring in emotional synthesis.
Text‑to‑Emotion (T2E) : uses DeepSeek‑R1 as a teacher to produce emotion distributions, which are distilled into Qwen3‑1.7B via LoRA. Seven basic emotions are supported: anger, joy, fear, disgust, sadness, surprise, and calm.
Training Strategy
Stage 1: train on the full dataset to establish basic capabilities.
Stage 2: fine‑tune on 135 h of high‑quality emotional data with a Gradient Reversal Layer (GRL) to decouple timbre and emotion.
Stage 3: second‑round fine‑tuning on the full dataset to improve robustness.
Installation
IndexTTS‑2 requires the uv package manager; conda and pip are not supported because of sensitive dependency versions.
# 1. Install uv
pip install -U uv
# 2. Clone repository
git clone https://github.com/index-tts/index-tts.git && cd index-tts
git lfs pull
# 3. Install dependencies
uv sync --all-extras
# 4. Download model
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpointsFor users in China with slow HuggingFace access, ModelScope can be used:
uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpointsHardware requirements
GPU: NVIDIA GPU with CUDA 12.8+ (recommended)
Memory: 8 GB + (16 GB recommended)
Storage: 10 GB +
Usage
Web demo (one‑click start) uv run webui.py Open http://127.0.0.1:7860 to try the demo. FP16 inference and DeepSpeed acceleration are available.
Python API example
from indextts.infer_v2 import IndexTTS2
tts = IndexTTS2(
cfg_path="checkpoints/config.yaml",
model_dir="checkpoints",
use_fp16=True, # save VRAM
use_cuda_kernel=False,
use_deepspeed=False,
)
text = "大家好,我是 AI 配音"
tts.infer(
spk_audio_prompt='examples/voice_01.wav',
text=text,
output_path="gen.wav",
)Emotion Control Methods
# Method 1: Reference audio for emotion
tts.infer(
spk_audio_prompt='voice.wav',
emo_audio_prompt='emo_sad.wav',
text=text,
emo_alpha=0.9, # intensity
output_path="gen.wav",
)
# Method 2: Emotion vector control (order: happy, angry, sad, afraid, disgusted, melancholic, surprised, calm)
tts.infer(
spk_audio_prompt='voice.wav',
emo_vector=[0,0,0,0,0,0,0.45,0],
text=text,
output_path="gen.wav",
)
# Method 3: Natural‑language description of emotion
tts.infer(
spk_audio_prompt='voice.wav',
text="快躲起来!是他要来了!他要来抓我们了!",
use_emo_text=True, # infer emotion from text
emo_alpha=0.6,
output_path="gen.wav",
)Performance Comparison
Experiments reported in the paper show IndexTTS‑2 outperforms peer models on LibriSpeech and a Chinese test set (SeedTTS‑zh). Results (WER ↓, emotion similarity ↑):
MaskGCT – LibriSpeech WER 3.58 %, SeedTTS‑zh WER 3.21 %, similarity 0.812
F5‑TTS – LibriSpeech WER 2.41 %, SeedTTS‑zh WER 3.35 %, similarity 0.795
CosyVoice2 – LibriSpeech WER 2.07 %, SeedTTS‑zh WER 2.43 %, similarity 0.831
SparkTTS – LibriSpeech WER 2.43 %, SeedTTS‑zh WER 2.87 %, similarity 0.847
IndexTTS‑2 – LibriSpeech WER 1.88 % , SeedTTS‑zh WER 2.12 % , similarity 0.872
Duration‑control accuracy is also remarkable: token‑count error rate stays below 0.03 % in scaling experiments, enabling near‑millisecond precision.
Links
Project repository: https://github.com/index-tts/index-tts
Paper: https://arxiv.org/abs/2506.21619
Online demo: https://huggingface.co/spaces/IndexTeam/IndexTTS-2-Demo
Model download: https://huggingface.co/IndexTeam/IndexTTS-2
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
