Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations
This article details the end‑to‑end workflow for creating intelligent digital humans for live streaming, covering large‑language‑model‑driven content generation, multi‑stage TTS architecture, extensive audio‑signal processing, speaker clustering, front‑end text normalization, back‑end acoustic modeling, and quantitative evaluation of model improvements.
Overview
We present a comprehensive practice summary of building intelligent digital humans for live streaming, focusing on six core components: LLM‑driven content generation, interactive dialogue, Text‑to‑Speech (TTS), visual driving, real‑time audio‑video rendering, and a stable backend service platform.
Data Processing Pipeline
To construct high‑quality training data from massive live‑stream recordings, we designed a three‑stage pipeline:
Audio signal processing – normalization, voice separation, denoising, VAD, and pause trimming.
Text annotation – ASR transcription, punctuation restoration, and rhythm labeling.
Speaker clustering – unsupervised embedding clustering to isolate distinct voice timbres.
The pipeline progressively filters low‑quality segments using DNS‑MOS scores, duration thresholds, and confidence metrics, yielding a clean corpus for TTS training.
Signal Processing Details
Normalization aligns sampling rates and loudness across diverse recordings. Voice‑separation (UVR_MDXNET) and Resemble Enhance remove background music and noise. VAD and fine‑grained silence detection prevent over‑long pauses, while DNS‑MOS filtering discards low‑quality audio.
ASR and Text Normalization
Automatic speech recognition (Seaco‑Paraformer and Whisper‑large‑v3‑turbo) provides initial transcripts. We then apply rule‑based and LLM‑based regularization to handle numbers, units, brand names, and special symbols, followed by punctuation repair using audio‑energy cues.
Speaker Clustering
Embedding‑based cosine similarity clustering groups utterances by speaker identity. Short‑duration clusters are removed, and high‑quality segments are selected for each speaker to build personalized voice bases.
Model Architecture
The TTS system follows a two‑stage design: a language model predicts discrete audio tokens (e.g., using Encodec or HuBERT tokenizers), and an acoustic model converts tokens to mel‑spectrograms, which a neural vocoder renders into waveforms. Recent versions incorporate VALLE‑style token prediction for better zero‑shot capabilities.
Front‑End Optimizations
Regularization combines rule‑based mappings (e.g., "5800mAh" → "五千八百毫安时") with LLM rewriting for complex cases.
Multi‑pronunciation handling improves G2P accuracy using 200 M open‑domain examples and 1.6 M manually annotated samples, reducing error rate from 5.81 % to 3.25 %.
Back‑End Optimizations
Version V1 uses a two‑stage encoder‑decoder with discrete token prediction (Encodec/HuBERT).
Version V2 focuses on pronunciation accuracy, integrating refined ASR and multilingual data, achieving CER 0.0380 and similarity 0.8650.
Version V3 adds rhythm and emotion modeling via explicit pause/drag‑phoneme tags and reference audio for prosody control.
Version V4 merges CosyVoice 2.0 (Qwen2.5‑0.5B backbone) with custom tokenizers and feature fusion, improving similarity to 0.9284 and DNS‑MOS to 3.3626.
Evaluation
Model
CER
Similarity
DNS‑MOS
V1
0.0542
0.8195
3.2209
V2
0.0380
0.8650
3.0653
V3
0.0228
0.8505
3.2517
V4
0.0269
0.9284
3.3626
Across versions, we observe consistent reductions in character error rate and improvements in perceptual quality, while maintaining or enhancing voice similarity for diverse e‑commerce scenarios.
Future Work
Leverage reinforcement learning to further improve rhythm replication.
Develop end‑to‑end speech understanding‑generation models.
Decouple rhythm and timbre for finer control.
Explore BGM, dialects, and multilingual extensions.
Team Introduction
The authors (Pingjiang, Longyu, Cangting) belong to the Taobao Live AIGC team, which builds a full‑stack AI solution for live‑stream e‑commerce, covering large‑language‑model research, multimodal semantics, speech synthesis, digital‑human rendering, and production‑grade deployment.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
