FireRedTTS-2: How the New Open-Source Model Achieves Human‑Like Multi‑Speaker Dialogue Synthesis
FireRedTTS-2, the latest open‑source dialogue TTS model from Xiaohongshu’s audio team, upgrades its speech tokenizer and text‑to‑speech architecture to enable low‑latency, per‑sentence generation, robust multi‑speaker switching, and natural prosody across multiple languages, outperforming rivals in both objective and subjective tests.
Xiaohongshu’s audio team recently released FireRedTTS-2, a next‑generation dialogue synthesis model that addresses common pain points such as poor flexibility, frequent pronunciation errors, unstable speaker switching, and unnatural prosody. By upgrading the discrete speech encoder and the text‑to‑speech model, FireRedTTS-2 achieves industry‑leading performance in both objective and subjective evaluations.
Technical report: https://arxiv.org/pdf/2509.02020 Demo: https://fireredteam.github.io/demos/firered_tts_2 Code: https://github.com/FireRedTeam/FireRedTTS2
The model can generate podcast‑quality audio that sounds indistinguishable from real recordings. Unlike closed‑source alternatives, it supports voice cloning with just a single sample per speaker, automatically producing entire multi‑speaker dialogues. It also handles multiple languages (Chinese, English, Japanese, Korean, French) and random voice generation, making it useful for both creative exploration and large‑scale data production.
Key Upgrades
Discrete Speech Encoder (Speech tokenizer) : 12.5 Hz low‑frame‑rate output, richer semantic information, and streaming decoding support.
Text‑to‑Speech Model : Enables per‑sentence generation with stable, high‑quality output.
3.1 Discrete Speech Encoder
The encoder compresses continuous audio into a discrete label sequence at 12.5 Hz, dramatically shortening sequence length and reducing the gap with text tokens. Semantic supervision during training enriches the labels, and streaming decoding allows real‑time audio output.
Training involves ~500 k hours of diverse speech data followed by ~60 k hours of high‑quality audio for fine‑tuning.
3.2 Text‑to‑Speech Model
FireRedTTS-2 adopts a mixed text‑audio format that supports per‑sentence generation, using speaker tags like [S1] to differentiate speakers. The architecture features a “double Transformer”: a 1.5 B‑parameter backbone Transformer for coarse audio modeling and a 0.2 B‑parameter decoder Transformer for acoustic details.
This design fully leverages both text and audio context, producing more natural and coherent dialogue speech with low initial‑packet latency when combined with the encoder’s streaming decoding.
Training proceeds in two stages: pre‑training on 1.1 M hours of single‑sentence speech, then fine‑tuning on 300 k hours of multi‑speaker dialogue (2–4 speakers), enabling stable high‑quality synthesis and speaker‑switch handling with minimal data for voice‑style fine‑tuning.
Evaluation
FireRedTTS-2 was benchmarked against MoonCast, ZipVoice‑Dialogue, and MOSS‑TTSD on a bilingual dialogue test set. Objective metrics (CER/WER, speaker similarity, MCD) and subjective preference scores (CMOS) both show FireRedTTS-2 leading, with significant reductions in pronunciation errors, speaker confusion, and more natural prosody.
Fine‑tuning with only ~50 hours of target‑speaker audio yields a CER of 1.66 % and a 56 % rate of samples judged as equally or more natural than real recordings.
Conclusion
Discrete speech encoder: low‑frame‑rate, semantically rich, streaming‑ready.
Text‑to‑speech model: mixed text‑audio input, per‑sentence generation, double‑Transformer architecture, low latency.
Outperforms existing systems on all evaluated metrics, offering an industrial‑grade solution for AI‑driven podcast and dialogue synthesis.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
