Artificial Intelligence 10 min read

Zero‑Shot Voice Cloning with Emotion and Duration Control: IndexTTS‑2 Runs Locally

IndexTTS‑2, an open‑source zero‑shot TTS system from B‑Station, enables precise duration control, emotion‑tone separation, and bilingual synthesis, offering a modern uv‑based installation, GPU‑accelerated inference, and benchmark‑leading WER and emotional similarity scores compared to contemporary models.

Old Zhang's AI Learning

Feb 7, 2026

Zero‑Shot Voice Cloning with Emotion and Duration Control: IndexTTS‑2 Runs Locally

Overview

IndexTTS-2 is a second‑generation zero‑shot text‑to‑speech model that clones a reference voice and allows independent emotion control while precisely matching a target duration.

Problem

Autoregressive TTS models (e.g., XTTS, CosyVoice) generate natural speech but cannot guarantee exact output length, which is required for video dubbing where timing must align to the visual track.

Core Capabilities

Voice + Emotion : combine a speaker’s timbre with a desired emotional tone.

Exact Duration : specify the target length; token‑count encoding yields an error rate below 0.02 %.

Natural‑Language Emotion Prompt : describe emotions such as "fear" or "surprise" and the model generates speech with the corresponding affect.

Multi‑Modal Emotion Input : reference audio, emotion vectors, or natural‑language description.

Bilingual Support : trained on 55 k h of data (30 k h Chinese, 25 k h English).

Open‑Source Commercial License : Apache 2.0, code and weights publicly available.

Architecture

The system consists of three modules:

Text‑to‑Semantic (T2S) : an autoregressive Transformer that converts text to semantic tokens. A length‑encoding mechanism embeds the desired token count, enabling the model to generate a predetermined number of tokens. Random token zero‑out is applied with 30 % probability; a free‑generation mode is also supported.

Semantic‑to‑Mel (S2M) : a non‑autoregressive Flow‑Matching module that transforms semantic tokens into mel‑spectrograms. It fuses the final hidden state of T2S (GPT‑style hidden‑state enhancement) to reduce phoneme blurring in emotional synthesis.

Text‑to‑Emotion (T2E) : uses DeepSeek‑R1 as a teacher to produce emotion distributions, which are distilled into Qwen3‑1.7B via LoRA. Seven basic emotions are supported: anger, joy, fear, disgust, sadness, surprise, and calm.

Training Strategy

Stage 1: train on the full dataset to establish basic capabilities.

Stage 2: fine‑tune on 135 h of high‑quality emotional data with a Gradient Reversal Layer (GRL) to decouple timbre and emotion.

Stage 3: second‑round fine‑tuning on the full dataset to improve robustness.

Installation

IndexTTS‑2 requires the uv package manager; conda and pip are not supported because of sensitive dependency versions.

# 1. Install uv
pip install -U uv

# 2. Clone repository
git clone https://github.com/index-tts/index-tts.git && cd index-tts
git lfs pull

# 3. Install dependencies
uv sync --all-extras

# 4. Download model
uv tool install "huggingface-hub[cli,hf_xet]"
hf download IndexTeam/IndexTTS-2 --local-dir=checkpoints

For users in China with slow HuggingFace access, ModelScope can be used:

uv tool install "modelscope"
modelscope download --model IndexTeam/IndexTTS-2 --local_dir checkpoints

Hardware requirements

GPU: NVIDIA GPU with CUDA 12.8+ (recommended)

Memory: 8 GB + (16 GB recommended)

Storage: 10 GB +

Usage

Web demo (one‑click start) uv run webui.py Open http://127.0.0.1:7860 to try the demo. FP16 inference and DeepSpeed acceleration are available.

Python API example

from indextts.infer_v2 import IndexTTS2

tts = IndexTTS2(
    cfg_path="checkpoints/config.yaml",
    model_dir="checkpoints",
    use_fp16=True,  # save VRAM
    use_cuda_kernel=False,
    use_deepspeed=False,
)

text = "大家好，我是 AI 配音"
tts.infer(
    spk_audio_prompt='examples/voice_01.wav',
    text=text,
    output_path="gen.wav",
)

Emotion Control Methods

# Method 1: Reference audio for emotion
tts.infer(
    spk_audio_prompt='voice.wav',
    emo_audio_prompt='emo_sad.wav',
    text=text,
    emo_alpha=0.9,  # intensity
    output_path="gen.wav",
)

# Method 2: Emotion vector control (order: happy, angry, sad, afraid, disgusted, melancholic, surprised, calm)
tts.infer(
    spk_audio_prompt='voice.wav',
    emo_vector=[0,0,0,0,0,0,0.45,0],
    text=text,
    output_path="gen.wav",
)

# Method 3: Natural‑language description of emotion
tts.infer(
    spk_audio_prompt='voice.wav',
    text="快躲起来！是他要来了！他要来抓我们了！",
    use_emo_text=True,  # infer emotion from text
    emo_alpha=0.6,
    output_path="gen.wav",
)

Performance Comparison

Experiments reported in the paper show IndexTTS‑2 outperforms peer models on LibriSpeech and a Chinese test set (SeedTTS‑zh). Results (WER ↓, emotion similarity ↑):

MaskGCT – LibriSpeech WER 3.58 %, SeedTTS‑zh WER 3.21 %, similarity 0.812

F5‑TTS – LibriSpeech WER 2.41 %, SeedTTS‑zh WER 3.35 %, similarity 0.795

CosyVoice2 – LibriSpeech WER 2.07 %, SeedTTS‑zh WER 2.43 %, similarity 0.831

SparkTTS – LibriSpeech WER 2.43 %, SeedTTS‑zh WER 2.87 %, similarity 0.847

IndexTTS‑2 – LibriSpeech WER 1.88 % , SeedTTS‑zh WER 2.12 % , similarity 0.872

Duration‑control accuracy is also remarkable: token‑count error rate stays below 0.03 % in scaling experiments, enabling near‑millisecond precision.