How We Built a High‑Accuracy AI‑Powered Digital Human Script Engine for Live Commerce
This article details the end‑to‑end AI pipeline for creating intelligent digital humans in live streaming, covering LLM‑driven script generation, multimodal data integration, error‑prone number handling, DPO fine‑tuning, experimental results, and future directions for more human‑like presentations.
Overview
We present a comprehensive practice summary of building intelligent digital humans, focusing on six core components: LLM‑driven script generation (the "brain"), LLM interaction logic, Text‑to‑Speech (TTS) for expressive voice, visual driving for synchronized facial expressions and gestures, audio‑video engineering for real‑time rendering, and a stable backend service platform.
Challenges in Script Generation
In live commerce, converting numbers, symbols, and English units into correct spoken forms is critical. Errors such as reading "8" as "八八" instead of "八八VIP" or mis‑pronouncing prices lead to audience disengagement and TTS failures.
Semantic‑aware rewriting : We add a preprocessing step that provides the correct reading based on context, reducing reliance on post‑processing rules.
Removing mechanical tone : By collecting real‑host ASR data and augmenting prompts, we enhance the naturalness of generated text.
Data Construction and Model Training
We created a high‑quality dataset by manually labeling difficult cases and augmenting them with GPT‑4 and DeepSeek‑R1. Since standard supervised fine‑tuning (SFT) struggled to capture subtle reading differences, we applied Direct Preference Optimization (DPO) to let the model learn nuanced distinctions, achieving 97% accuracy on the test set.
Experiments
Four experiments were conducted:
SFT on Qwen2.5‑7B with manually labeled data (92% accuracy).
SFT with generic data augmentation (88% accuracy).
SFT with difficult‑sample augmentation (95% accuracy).
SFT + DPO on difficult samples (97% accuracy).
Results show that targeted data augmentation and DPO significantly improve performance over generic augmentation.
Multi‑Source Information Integration
Beyond product details, we incorporate user reviews (问评买), real‑time promotional information, and visual asset understanding. A knowledge graph (iGraph) stores offline data for millisecond‑level queries, while online services provide up‑to‑date discounts.
Material Understanding
We extract text from product images using OCR, then summarize and classify the content. Challenges include inaccurate OCR summaries, unordered text, and redundant information. To filter images, we use clustering on extracted text, remove overly long or tiny‑font images, and prioritize categories based on business relevance.
Evaluation Metrics
We define a multi‑dimensional scoring system covering formatting, oral style, credibility, safety, richness, targeted selling points, and persuasive closing. Metrics combine rule‑based checks, LLM judgments, and statistical analyses such as distinct‑k diversity and sentence similarity.
Future Directions
Planned work includes deeper multimodal understanding (MLLM for video), real‑time visual overlays matching script content, and richer evaluation criteria that capture human‑like storytelling and audience engagement.
Team Introduction
The authors, from the Taobao Live AIGC team, specialize in large language models, multimodal perception, speech synthesis, digital human modeling, and end‑to‑end AI deployment for e‑commerce live streaming.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
