Artificial Intelligence 20 min read

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

The article details Baidu HuiBosheng's end‑to‑end AI live‑streaming platform, covering merchant workflow, multimodal product understanding, style‑aware script generation, reinforcement‑learning‑driven smart control, voice and avatar cloning, and a data‑flywheel that continuously improves model performance, illustrated with real‑world GMV results.

Baidu Tech Salon

Dec 8, 2025

How Baidu’s HuiBosheng AI Live Platform Generates Super‑Human Scripts and Real‑Time Interaction

Platform Overview

Since its launch in 2023, Baidu HuiBosheng has evolved into a full‑stack AI live‑streaming platform that integrates script generation, real‑time Q&A, intelligent control, and voice‑avatar cloning. It serves over 20,000 live rooms daily across e‑commerce, education, health, finance and other sectors.

Merchant Workflow

Product selection – Merchants choose items from Baidu’s own stores, third‑party platforms (Taobao, JD, Pinduoduo) or local services.

Avatar selection or customization – Choose from a public library of 7,800+ avatars or upload a 5‑minute video to create a private avatar.

Live‑room design – Pick from 3,600+ templates or let AI generate background graphics and marketing widgets.

Script generation – Select a public script style or provide a custom brief (minimum 400 words) to generate a product‑focused, conversational script.

Voice selection or creation – Choose from 3,200+ public voice tones or record a short clip to obtain a private voice model within three days.

Interactive configuration – Enable one‑click Q&A takeover or manually configure preset Q&A pairs to enrich the knowledge base.

Technical Architecture

The system consists of merchant‑side services, multimodal visual‑speech‑text models, a real‑time rendering engine, and internal/external distribution layers.

Core Modules

Product understanding – Multimodal OCR and layout analysis extract key selling points, target audience, usage scenarios, and other structured knowledge from product images and text.

Script generation – A large‑language model pre‑trained on e‑commerce live‑stream corpora, fine‑tuned with expert‑labeled data (SFT) and reinforced with RLHF to produce style‑consistent, high‑conversion scripts.

Intelligent Q&A – A Retrieval‑Augmented Generation (RAG) pipeline retrieves relevant product knowledge and generates precise answers for both chat and spoken replies.

Smart control – A reinforcement‑learning agent decides optimal actions (e.g., invite comments, push sales, switch product focus) based on live‑room state (viewers, comments, product, etc.) and receives rewards from order volume, comment growth, and watch time.

Live‑room rendering – AI synthesizes background images and marketing components to create a cohesive visual experience.

Style‑Aware Script Generation

Merchants provide a product and a brief marketing cue; the system analyses the desired style (pace, emotion, storytelling technique) and generates a script that mirrors the chosen influencer’s tone while embedding product‑specific selling points.

Adoption rate of generated scripts reaches 92%, with a 67% live‑room penetration and a 14% conversion uplift over manually written scripts.

The script engine is also offered as a standalone tool for non‑live‑stream use cases.

Data Flywheel

Two complementary loops continuously improve the models: a “prior” alignment loop uses multi‑model preference voting to create high‑quality reward data without heavy human labeling; a “posterior” loop collects real‑time user feedback (engagement, conversion) to fine‑tune the models via uplift modeling and causal inference (S‑Learner/T‑Learner).

Reinforcement‑Learning Control Agent

The agent observes live‑room state St (viewers, comments, current product, etc.), selects an action At (e.g., invite comments, push sales, change product focus), and receives reward Rt based on KPI changes. Iterative trial‑and‑error yields policies that maximize long‑term metrics such as total orders and average watch time.

Voice Cloning & Synthesis

Using style‑transfer TTS, the platform supports two voice modes – natural and “passionate sales” – with a 30.3%→92.8% adoption increase and synthesis latency reduced from 1 month to 1 minute. Over 120,000 public voices and 27,000 custom voices are available.

Avatar Cloning & Synthesis

Avatar creation progressed through four stages, moving from closed‑mouth, no‑obstruction recordings to full‑face, multi‑person, action‑driven avatars. Currently, more than 320,000 public avatars and 80,000 custom avatars have been generated, with a 95% online availability rate.

Conclusion

After two years of development, HuiBosheng has become a multimodal AI live‑streaming platform that not only replicates human主播 behavior but also surpasses it through product‑aware generation, RL‑driven decision making, and large‑scale data flywheels. Ongoing work focuses on further improving script precision, interaction naturalness, visual realism, voice expressiveness, and smarter decision policies.

live streaming AI multimodal reinforcement learning Script Generation voice cloning Data Flywheel

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.