Artificial Intelligence 23 min read

How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation

This article presents a comprehensive practice summary of building an intelligent digital‑human system, covering six core modules—LLM content generation, LLM interaction, TTS synthesis, visual driving, audio‑video engineering, and backend services—while detailing data collection, signal processing, ASR annotation, speaker clustering, model optimization (V1‑V4), evaluation metrics, and future research directions.

Cognitive Technology Team

Jul 1, 2025

How We Built a Live‑Streaming TTS Engine: From Data Pipelines to AI Voice Generation

Introduction

This series shares part of our practice in creating intelligent digital humans. We explore six core links: LLM‑driven content generation gives the digital human a "brain"; LLM interaction focuses on dialogue logic and human‑like communication, which is key to natural interaction; TTS (text‑to‑speech) converts text into emotionally rich, personalized voice; visual driving synchronizes voice with facial expressions, lip movements, and body actions to create realistic visual avatars; audio‑video engineering solves real‑time rendering, low‑latency transmission, and high‑quality video output; finally, the backend provides a stable, elastic, high‑concurrency platform to ensure efficient and reliable digital‑human services.

Data Processing Pipeline

Voice signal processing: normalize audio, separate voice from background music, apply VAD, remove noise, and truncate pauses, while filtering low‑quality audio based on quality scores.

Text annotation (speech understanding): use ASR to transcribe audio, correct punctuation, add prosody markers, and filter low‑confidence transcripts.

Speaker clustering: extract speaker embeddings, cluster them using unsupervised methods, filter short segments, and select high‑quality data for each speaker.

Voice Signal Processing

Live‑stream audio often contains background music, multiple speakers, and noise, which affect model training. We use UVR‑MDXNET for voice‑music separation, Resemble Enhance for further denoising, and speaker diarization to distinguish speakers. After separation, we apply VAD and fine‑grained silence detection to cut overly long pauses, then filter segments by duration and quality scores.

Text Annotation

ASR provides the textual input for TTS. To improve punctuation accuracy, we align punctuation with actual speech pauses and correct over‑ or under‑punctuated marks. We also enhance ASR with domain‑specific hot‑words (e.g., product names, units) and use a two‑stage model (Seaco‑Paraformer and Whisper‑large‑v3‑turbo) with cross‑validation to boost confidence and reduce errors.

Speaker Clustering

Historical live‑stream recordings often contain multiple hosts. We extract voice embeddings, compute cosine similarity, and perform unsupervised clustering to ensure each cluster corresponds to a single voice. Short‑duration clusters are discarded, and high‑quality segments are selected for training distinct speaker models.

TTS Front‑End and Back‑End

The TTS system consists of a front‑end (text normalization, phoneme conversion) and a back‑end (acoustic model and vocoder). The front‑end handles abbreviation expansion, number conversion, and special token handling. The back‑end predicts mel‑spectrograms from intermediate representations and synthesizes waveforms with a vocoder.

Front‑End Optimization

We combine rule‑based processing with LLM rewriting to handle numbers, model numbers, and brand names. Rules handle deterministic cases (e.g., units, symbols), while LLMs rewrite ambiguous cases (e.g., "58W" → "fifty‑eight watts").

Back‑End Optimization

Version V1 uses a two‑stage architecture with discrete audio tokens (e.g., Encodec, HuBERT) and a language model to predict token sequences. V2 improves pronunciation accuracy by refining ASR, adding multilingual data, and training a base model on large live‑stream datasets. V3 adds prosody and emotion modeling by tagging pauses ("@") and elongations ("→") and using reference audio for pitch variation. V4 integrates the CosyVoice 2.0 architecture, leveraging a larger Qwen2.5‑0.5B backbone, enhanced tokenizer, and multi‑phoneme features to further improve pronunciation, especially for polyphones and rare characters.

Evaluation Metrics

We report character error rate (CER), similarity, and DNS‑MOS for several model versions. Across V1‑V4, CER decreased from 0.0542 to 0.0228, similarity increased from 0.8195 to 0.9284, and DNS‑MOS improved from 3.2209 to 3.3626, demonstrating progressive gains in accuracy and audio quality.

Audio Examples

Examples include generic prosody, speaker‑specific prosody, and emotion‑enhanced synthesis. The tables show before/after text, generated audio, and metric improvements.

Future Outlook

Use pretrained LLMs and reinforcement learning to further improve prosody replication.

Develop end‑to‑end speech understanding and generation models.

Decouple prosody and timbre features for more flexible control.

Explore new scenarios such as background music, dialects, and multilingual synthesis.

Team Introduction

The authors (Pingjiang, Longyu, Cangting) belong to the Live‑AIGC team of Taobao Live. The team focuses on AI‑native technologies for e‑commerce live streaming, covering large language models, multimodal understanding, speech synthesis, digital‑human rendering, AI engineering, and audio‑video processing. Their solutions have been commercialized, serving thousands of merchants.