Artificial Intelligence 10 min read

FireRedTTS-2: How the New Open-Source Model Achieves Human‑Like Multi‑Speaker Dialogue Synthesis

FireRedTTS-2, the latest open‑source dialogue TTS model from Xiaohongshu’s audio team, upgrades its speech tokenizer and text‑to‑speech architecture to enable low‑latency, per‑sentence generation, robust multi‑speaker switching, and natural prosody across multiple languages, outperforming rivals in both objective and subjective tests.

Xiaohongshu Tech REDtech

Sep 19, 2025

FireRedTTS-2: How the New Open-Source Model Achieves Human‑Like Multi‑Speaker Dialogue Synthesis

Xiaohongshu’s audio team recently released FireRedTTS-2, a next‑generation dialogue synthesis model that addresses common pain points such as poor flexibility, frequent pronunciation errors, unstable speaker switching, and unnatural prosody. By upgrading the discrete speech encoder and the text‑to‑speech model, FireRedTTS-2 achieves industry‑leading performance in both objective and subjective evaluations.

Technical report: https://arxiv.org/pdf/2509.02020 Demo: https://fireredteam.github.io/demos/firered_tts_2 Code: https://github.com/FireRedTeam/FireRedTTS2

The model can generate podcast‑quality audio that sounds indistinguishable from real recordings. Unlike closed‑source alternatives, it supports voice cloning with just a single sample per speaker, automatically producing entire multi‑speaker dialogues. It also handles multiple languages (Chinese, English, Japanese, Korean, French) and random voice generation, making it useful for both creative exploration and large‑scale data production.

Key Upgrades

Discrete Speech Encoder (Speech tokenizer) : 12.5 Hz low‑frame‑rate output, richer semantic information, and streaming decoding support.

Text‑to‑Speech Model : Enables per‑sentence generation with stable, high‑quality output.

3.1 Discrete Speech Encoder

The encoder compresses continuous audio into a discrete label sequence at 12.5 Hz, dramatically shortening sequence length and reducing the gap with text tokens. Semantic supervision during training enriches the labels, and streaming decoding allows real‑time audio output.

Training involves ~500 k hours of diverse speech data followed by ~60 k hours of high‑quality audio for fine‑tuning.

3.2 Text‑to‑Speech Model

FireRedTTS-2 adopts a mixed text‑audio format that supports per‑sentence generation, using speaker tags like [S1] to differentiate speakers. The architecture features a “double Transformer”: a 1.5 B‑parameter backbone Transformer for coarse audio modeling and a 0.2 B‑parameter decoder Transformer for acoustic details.

This design fully leverages both text and audio context, producing more natural and coherent dialogue speech with low initial‑packet latency when combined with the encoder’s streaming decoding.

Training proceeds in two stages: pre‑training on 1.1 M hours of single‑sentence speech, then fine‑tuning on 300 k hours of multi‑speaker dialogue (2–4 speakers), enabling stable high‑quality synthesis and speaker‑switch handling with minimal data for voice‑style fine‑tuning.

Evaluation

FireRedTTS-2 was benchmarked against MoonCast, ZipVoice‑Dialogue, and MOSS‑TTSD on a bilingual dialogue test set. Objective metrics (CER/WER, speaker similarity, MCD) and subjective preference scores (CMOS) both show FireRedTTS-2 leading, with significant reductions in pronunciation errors, speaker confusion, and more natural prosody.

Fine‑tuning with only ~50 hours of target‑speaker audio yields a CER of 1.66 % and a 56 % rate of samples judged as equally or more natural than real recordings.