Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

This article details the end‑to‑end workflow for creating intelligent digital humans for live streaming, covering large‑language‑model‑driven content generation, multi‑stage TTS architecture, extensive audio‑signal processing, speaker clustering, front‑end text normalization, back‑end acoustic modeling, and quantitative evaluation of model improvements.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

Overview

We present a comprehensive practice summary of building intelligent digital humans for live streaming, focusing on six core components: LLM‑driven content generation, interactive dialogue, Text‑to‑Speech (TTS), visual driving, real‑time audio‑video rendering, and a stable backend service platform.

Data Processing Pipeline

To construct high‑quality training data from massive live‑stream recordings, we designed a three‑stage pipeline:

Audio signal processing – normalization, voice separation, denoising, VAD, and pause trimming.

Text annotation – ASR transcription, punctuation restoration, and rhythm labeling.

Speaker clustering – unsupervised embedding clustering to isolate distinct voice timbres.

The pipeline progressively filters low‑quality segments using DNS‑MOS scores, duration thresholds, and confidence metrics, yielding a clean corpus for TTS training.

Data processing pipeline diagram
Data processing pipeline diagram

Signal Processing Details

Normalization aligns sampling rates and loudness across diverse recordings. Voice‑separation (UVR_MDXNET) and Resemble Enhance remove background music and noise. VAD and fine‑grained silence detection prevent over‑long pauses, while DNS‑MOS filtering discards low‑quality audio.

ASR and Text Normalization

Automatic speech recognition (Seaco‑Paraformer and Whisper‑large‑v3‑turbo) provides initial transcripts. We then apply rule‑based and LLM‑based regularization to handle numbers, units, brand names, and special symbols, followed by punctuation repair using audio‑energy cues.

Speaker Clustering

Embedding‑based cosine similarity clustering groups utterances by speaker identity. Short‑duration clusters are removed, and high‑quality segments are selected for each speaker to build personalized voice bases.

Model Architecture

The TTS system follows a two‑stage design: a language model predicts discrete audio tokens (e.g., using Encodec or HuBERT tokenizers), and an acoustic model converts tokens to mel‑spectrograms, which a neural vocoder renders into waveforms. Recent versions incorporate VALLE‑style token prediction for better zero‑shot capabilities.

Two‑stage TTS architecture
Two‑stage TTS architecture

Front‑End Optimizations

Regularization combines rule‑based mappings (e.g., "5800mAh" → "五千八百毫安时") with LLM rewriting for complex cases.

Multi‑pronunciation handling improves G2P accuracy using 200 M open‑domain examples and 1.6 M manually annotated samples, reducing error rate from 5.81 % to 3.25 %.

Back‑End Optimizations

Version V1 uses a two‑stage encoder‑decoder with discrete token prediction (Encodec/HuBERT).

Version V2 focuses on pronunciation accuracy, integrating refined ASR and multilingual data, achieving CER 0.0380 and similarity 0.8650.

Version V3 adds rhythm and emotion modeling via explicit pause/drag‑phoneme tags and reference audio for prosody control.

Version V4 merges CosyVoice 2.0 (Qwen2.5‑0.5B backbone) with custom tokenizers and feature fusion, improving similarity to 0.9284 and DNS‑MOS to 3.3626.

Evaluation

Model

CER

Similarity

DNS‑MOS

V1

0.0542

0.8195

3.2209

V2

0.0380

0.8650

3.0653

V3

0.0228

0.8505

3.2517

V4

0.0269

0.9284

3.3626

Across versions, we observe consistent reductions in character error rate and improvements in perceptual quality, while maintaining or enhancing voice similarity for diverse e‑commerce scenarios.

Future Work

Leverage reinforcement learning to further improve rhythm replication.

Develop end‑to‑end speech understanding‑generation models.

Decouple rhythm and timbre for finer control.

Explore BGM, dialects, and multilingual extensions.

Team Introduction

The authors (Pingjiang, Longyu, Cangting) belong to the Taobao Live AIGC team, which builds a full‑stack AI solution for live‑stream e‑commerce, covering large‑language‑model research, multimodal semantics, speech synthesis, digital‑human rendering, and production‑grade deployment.

live streamingAIdata processingTTSdigital humanspeech synthesis
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.