Artificial Intelligence 22 min read

Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

This article details the end‑to‑end workflow for creating intelligent digital humans for live streaming, covering large‑language‑model‑driven content generation, multi‑stage TTS architecture, extensive audio‑signal processing, speaker clustering, front‑end text normalization, back‑end acoustic modeling, and quantitative evaluation of model improvements.

DaTaobao Tech

Jun 27, 2025

Building a High‑Quality Live‑Streaming Digital Human: TTS Pipeline, Data Processing, and Model Optimizations

Overview

We present a comprehensive practice summary of building intelligent digital humans for live streaming, focusing on six core components: LLM‑driven content generation, interactive dialogue, Text‑to‑Speech (TTS), visual driving, real‑time audio‑video rendering, and a stable backend service platform.

Data Processing Pipeline

To construct high‑quality training data from massive live‑stream recordings, we designed a three‑stage pipeline:

Audio signal processing – normalization, voice separation, denoising, VAD, and pause trimming.

Text annotation – ASR transcription, punctuation restoration, and rhythm labeling.

Speaker clustering – unsupervised embedding clustering to isolate distinct voice timbres.

The pipeline progressively filters low‑quality segments using DNS‑MOS scores, duration thresholds, and confidence metrics, yielding a clean corpus for TTS training.

Signal Processing Details

Normalization aligns sampling rates and loudness across diverse recordings. Voice‑separation (UVR_MDXNET) and Resemble Enhance remove background music and noise. VAD and fine‑grained silence detection prevent over‑long pauses, while DNS‑MOS filtering discards low‑quality audio.

ASR and Text Normalization

Automatic speech recognition (Seaco‑Paraformer and Whisper‑large‑v3‑turbo) provides initial transcripts. We then apply rule‑based and LLM‑based regularization to handle numbers, units, brand names, and special symbols, followed by punctuation repair using audio‑energy cues.

Speaker Clustering

Embedding‑based cosine similarity clustering groups utterances by speaker identity. Short‑duration clusters are removed, and high‑quality segments are selected for each speaker to build personalized voice bases.

Model Architecture

The TTS system follows a two‑stage design: a language model predicts discrete audio tokens (e.g., using Encodec or HuBERT tokenizers), and an acoustic model converts tokens to mel‑spectrograms, which a neural vocoder renders into waveforms. Recent versions incorporate VALLE‑style token prediction for better zero‑shot capabilities.

Front‑End Optimizations

Regularization combines rule‑based mappings (e.g., "5800mAh" → "五千八百毫安时") with LLM rewriting for complex cases.

Multi‑pronunciation handling improves G2P accuracy using 200 M open‑domain examples and 1.6 M manually annotated samples, reducing error rate from 5.81 % to 3.25 %.

Back‑End Optimizations

Version V1 uses a two‑stage encoder‑decoder with discrete token prediction (Encodec/HuBERT).

Version V2 focuses on pronunciation accuracy, integrating refined ASR and multilingual data, achieving CER 0.0380 and similarity 0.8650.

Version V3 adds rhythm and emotion modeling via explicit pause/drag‑phoneme tags and reference audio for prosody control.

Version V4 merges CosyVoice 2.0 (Qwen2.5‑0.5B backbone) with custom tokenizers and feature fusion, improving similarity to 0.9284 and DNS‑MOS to 3.3626.

Evaluation

Model

CER

Similarity

DNS‑MOS

0.0542

0.8195

3.2209

0.0380

0.8650

3.0653

0.0228

0.8505

3.2517

0.0269

0.9284

3.3626

Across versions, we observe consistent reductions in character error rate and improvements in perceptual quality, while maintaining or enhancing voice similarity for diverse e‑commerce scenarios.

Future Work

Leverage reinforcement learning to further improve rhythm replication.

Develop end‑to‑end speech understanding‑generation models.

Decouple rhythm and timbre for finer control.

Explore BGM, dialects, and multilingual extensions.

Team Introduction

The authors (Pingjiang, Longyu, Cangting) belong to the Taobao Live AIGC team, which builds a full‑stack AI solution for live‑stream e‑commerce, covering large‑language‑model research, multimodal semantics, speech synthesis, digital‑human rendering, and production‑grade deployment.

live streaming AI data processing TTS digital human speech synthesis

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.