Artificial Intelligence 8 min read

How AI Powers 24/7 Digital Human Live Streams: Architecture, Challenges, and Innovations

This article presents a comprehensive overview of the AI‑driven digital‑human live‑streaming solution used by Taobao, detailing six core components—including LLM‑based content generation and interaction, TTS, visual driving, audio‑video engineering, and backend services—while sharing architectural diagrams, cost‑reduction strategies, productization insights, and future directions.

DaTaobao Tech

Jul 2, 2025

How AI Powers 24/7 Digital Human Live Streams: Architecture, Challenges, and Innovations

Overview

The project builds an intelligent digital‑human system for live streaming, focusing on six core stages: LLM‑based copy generation (the "brain"), LLM‑driven interaction (natural dialogue), TTS (emotional voice synthesis), visual driving (synchronizing voice, facial expression, lip sync, and body motion), audio‑video engineering (real‑time rendering, low‑latency transmission, high‑quality output), and backend services (stable, elastic, high‑concurrency platform).

Key Articles

Taobao Live Digital Human LLM Inference Optimization: Model Distillation and Path Compression

Taobao Live Digital Human: LLM Copy Generation Technology

Taobao Live Digital Human: LLM Danmaku Interaction Technology

Taobao Live Digital Human: TTS Voice Synthesis Technology

Taobao Live Digital Human: Visual Driving Technology

Business Value and Pain Points

Non‑broadcast time slots: enable 24‑hour autonomous streaming using cloned avatars.

High cost of host explanations: reduce by auto‑generating product copy with LLM.

Inability to reply to massive comments: achieve real‑time danmaku interaction with LLM dialogue.

Complex product display operations: automate visual material, product cards, and effects.

Core Chain Overview

The live‑streaming pipeline includes audio/video capture, rendering/mixing, encoding, transmission, GRTN, and playback. Diagrams illustrate each stage and data flow, with a byte‑level view of audio and video processing.

LiveCopilot Architecture

LiveCopilot integrates rendering, audio‑video, and AI engineering, delivering LLM, TTS, and lip‑driving capabilities in live scenarios. The architecture consists of AI engineering, audio‑video rendering, and live/short‑video modules.

Cost Reduction & Innovation

Endpoint‑cloud combination lowers overall digital‑human cost.

TTS splitting improves online quality and reduces compute cost.

Material‑copy integration enriches live explanations by pulling product assets and merging with foreground video.

Productization Thoughts

Focus on user pain points, simplify steps, and minimize documentation.

Iterate quickly with weekly demos.

Engage seed users, build trust, and collect feedback.

Future Directions

Digital‑human assistants and customer service avatars.

Assistive streaming for people with disabilities.

Personalized digital assistants for every user.

Education‑wide digital teachers.

Digital memory: cloning voices and personas for lasting presence.

Team Introduction

The author, Jing Jiang, is from the Taobao Group Live AIGC team, which pioneers AI‑native technologies for e‑commerce live streaming, covering large language models, multimodal understanding, speech synthesis, digital‑human modeling, AI deployment, and audio‑video processing. The team has built an end‑to‑end AI stack and commercialized the digital‑human live solution for thousands of merchants.