IndexTTS2: Emotionally Expressive, Duration-Controlled Zero-Shot TTS

IndexTTS2 introduces a novel auto-regressive zero-shot text-to-speech model that achieves precise duration control and fine-grained emotional expression through a universal time‑encoding mechanism, decoupled voice‑style and emotion modeling, and a GPT‑style latent feature, outperforming state‑of‑the‑art baselines across multiple benchmarks.

Bilibili Tech
Bilibili Tech
Bilibili Tech
IndexTTS2: Emotionally Expressive, Duration-Controlled Zero-Shot TTS

Overview

In the context of rapidly evolving speech synthesis, IndexTTS2 is presented as a next‑generation model that improves emotional nuance and duration precision over its predecessor IndexTTS. By integrating a universal time‑encoding mechanism compatible with auto‑regressive (AR) architectures, the model enables exact control of speech length while preserving natural prosody, speaker similarity, and multimodal extensibility.

Figure 1: IndexTTS2 overall framework
Figure 1: IndexTTS2 overall framework

Method

AR‑based Text‑to‑Semantic (T2S) module

The T2S module treats text‑to‑semantic conversion as an auto‑regressive token‑prediction task, following the training paradigm of large language models. It introduces two key innovations: a duration‑control embedding that allows users to specify the exact number of generated tokens, and an emotion‑control embedding that decouples emotional information from speaker identity.

Figure 2: Auto‑regressive Text‑to‑Semantic
Figure 2: Auto‑regressive Text‑to‑Semantic

Duration control in AR

During training, a special embedding p is inserted into the token sequence to regulate the number of output semantic tokens. The same embedding table is shared between the semantic position embedding W_{sem} and the token‑count embedding W_{num}, ensuring that token length directly corresponds to the desired duration. Random speed adjustments with coefficients r_1 and r_2 further improve accuracy, and the embedding is zero‑ed with 30% probability to support free‑generation mode.

Emotion embedding

An emotion‑perceiver conditioner, based on a Conformer backbone, extracts an emotion embedding e from style prompts. A gradient‑reversal layer separates emotion from speaker‑dependent attributes. In the second training stage, a large‑scale neutral speech corpus fine‑tunes the model while keeping the emotion conditioner frozen. Seven canonical emotions are defined, and a language‑model‑driven soft‑command mechanism maps natural‑language descriptions to emotion vectors, enabling both audio‑reference and text‑based emotional control.

S2M module

The Semantic‑to‑Mel (S2M) module adopts a flow‑matching non‑autoregressive framework. Conditional Flow Matching (CFM) learns an ODE that transforms simple Gaussian noise into a target mel‑spectrogram, conditioned on the speaker reference and the enriched semantic tokens from T2S. Randomly mixing the GPT‑style latent features H_{gpt} with the semantic tokens improves robustness and pronunciation quality.

Figure 3: Flow‑matching based Semantic‑to‑Mel
Figure 3: Flow‑matching based Semantic‑to‑Mel

Experiments

Experimental setup

55 K utterances (30 K Chinese, 25 K English) were used for training, including 135 h of emotional speech from 361 speakers. Evaluation employed four public test sets (LibriSpeech‑test‑clean, SeedTTS‑test‑zh, SeedTTS‑test‑en, AISHELL‑1) and a custom emotional test set covering seven emotions recorded by 12 speakers.

Evaluation metrics

Objective metrics: word error rate (WER), speaker similarity (SS) via cosine similarity of speaker embeddings, and emotion similarity (ES) using emotion2vec. Subjective metrics: multi‑dimensional MOS (SMOS, PMOS, QMOS, EMOS) on a 1‑5 scale.

Results

IndexTTS2 achieved state‑of‑the‑art objective scores on all test sets except AISHELL‑1, where the gap was marginal. Subjectively, it matched or exceeded baselines on SMOS, PMOS, QMOS, and EMOS, demonstrating superior emotional fidelity while maintaining low WER (1.883%).

Table 1: Results on public test sets
Table 1: Results on public test sets

On the dedicated emotion test set, IndexTTS2 attained an emotion similarity of 0.887 and an EMOS of 4.22, substantially outperforming competing zero‑shot TTS systems.

Table 2: Emotion test set results
Table 2: Emotion test set results

Duration‑control experiments showed token‑count error rates below 0.03 % across scaling factors from 0.75× to 1.25× the original length, confirming precise timing control.

Table 3: Token count error rates for duration control
Table 3: Token count error rates for duration control

Ablation studies demonstrated that removing GPT latent features or replacing the S2M module with a discrete‑token S2A module degrades both objective and subjective scores, highlighting the importance of the proposed components.

Table 4: Ablation results
Table 4: Ablation results

Conclusion

IndexTTS2 presents a zero‑shot text‑to‑speech system that combines a novel time‑encoding scheme for exact duration control, decoupled emotion‑speaker modeling, and GPT‑style latent representations to deliver high‑quality, emotionally expressive speech. The model advances practical applications such as AI dubbing, audiobooks, dynamic animation, and video translation.

References

S. Lee et al., “BigVGAN: A universal neural vocoder with large‑scale training,” ICLR 2023.

W. Deng et al., “IndexTTS: An industrial‑level controllable and efficient zero‑shot TTS system,” arXiv preprint, 2025.

M. V. Koroteev, “BERT: a review of applications in NLP and understanding,” arXiv, 2021.

H. He et al., “Emilia: An extensive, multilingual, and diverse speech dataset for large‑scale speech generation,” IEEE SLT 2024.

K. Zhou et al., “Seen and unseen emotional style transfer for voice conversion,” ICASSP 2021.

P. Anastassiou et al., “Seed‑TTS: A family of high‑quality versatile speech generation models,” arXiv, 2024.

T. Guo et al., “DiDiSpeech: A large‑scale Mandarin speech corpus,” ICASSP 2021.

V. Panayotov et al., “LibriSpeech: an ASR corpus based on public domain audio books,” ICASSP 2015.

H. Bu et al., “AISHELL‑1: An open‑source Mandarin speech corpus and a speech recognition baseline,” O‑COCOSDA 2017.

Z. Gao et al., “FunASR: A fundamental end‑to‑end speech recognition toolkit,” Interspeech 2023.

A. Radford et al., “Robust speech recognition via large‑scale weak supervision,” ICML 2023.

Z. Ma et al., “Emotion2Vec: Self‑supervised pre‑training for speech emotion representation,” ACL 2024.

Y. Wang et al., “MaskGCT: Zero‑shot TTS with masked generative codec transformer,” arXiv, 2024.

Y. Chen et al., “F5‑TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” arXiv, 2024.

Z. Du et al., “CosyVoice 2: Scalable streaming speech synthesis with large language models,” arXiv, 2024.

X. Wang et al., “Spark‑TTS: An efficient LLM‑based TTS model with single‑stream decoupled speech tokens,” arXiv, 2025.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

zero-shottext-to-speechduration controlemotional synthesisspeech generation
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.