How We Built a High‑Accuracy AI‑Powered Digital Human Script Engine for Live Commerce

This article details the end‑to‑end AI pipeline for creating intelligent digital humans in live streaming, covering LLM‑driven script generation, multimodal data integration, error‑prone number handling, DPO fine‑tuning, experimental results, and future directions for more human‑like presentations.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
How We Built a High‑Accuracy AI‑Powered Digital Human Script Engine for Live Commerce

Overview

We present a comprehensive practice summary of building intelligent digital humans, focusing on six core components: LLM‑driven script generation (the "brain"), LLM interaction logic, Text‑to‑Speech (TTS) for expressive voice, visual driving for synchronized facial expressions and gestures, audio‑video engineering for real‑time rendering, and a stable backend service platform.

Challenges in Script Generation

In live commerce, converting numbers, symbols, and English units into correct spoken forms is critical. Errors such as reading "8" as "八八" instead of "八八VIP" or mis‑pronouncing prices lead to audience disengagement and TTS failures.

Semantic‑aware rewriting : We add a preprocessing step that provides the correct reading based on context, reducing reliance on post‑processing rules.

Removing mechanical tone : By collecting real‑host ASR data and augmenting prompts, we enhance the naturalness of generated text.

Data Construction and Model Training

We created a high‑quality dataset by manually labeling difficult cases and augmenting them with GPT‑4 and DeepSeek‑R1. Since standard supervised fine‑tuning (SFT) struggled to capture subtle reading differences, we applied Direct Preference Optimization (DPO) to let the model learn nuanced distinctions, achieving 97% accuracy on the test set.

Experiments

Four experiments were conducted:

SFT on Qwen2.5‑7B with manually labeled data (92% accuracy).

SFT with generic data augmentation (88% accuracy).

SFT with difficult‑sample augmentation (95% accuracy).

SFT + DPO on difficult samples (97% accuracy).

Results show that targeted data augmentation and DPO significantly improve performance over generic augmentation.

Multi‑Source Information Integration

Beyond product details, we incorporate user reviews (问评买), real‑time promotional information, and visual asset understanding. A knowledge graph (iGraph) stores offline data for millisecond‑level queries, while online services provide up‑to‑date discounts.

Material Understanding

We extract text from product images using OCR, then summarize and classify the content. Challenges include inaccurate OCR summaries, unordered text, and redundant information. To filter images, we use clustering on extracted text, remove overly long or tiny‑font images, and prioritize categories based on business relevance.

Evaluation Metrics

We define a multi‑dimensional scoring system covering formatting, oral style, credibility, safety, richness, targeted selling points, and persuasive closing. Metrics combine rule‑based checks, LLM judgments, and statistical analyses such as distinct‑k diversity and sentence similarity.

Future Directions

Planned work includes deeper multimodal understanding (MLLM for video), real‑time visual overlays matching script content, and richer evaluation criteria that capture human‑like storytelling and audience engagement.

Team Introduction

The authors, from the Taobao Live AIGC team, specialize in large language models, multimodal perception, speech synthesis, digital human modeling, and end‑to‑end AI deployment for e‑commerce live streaming.

live streamingAILLMdigital humanScript Generation
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.