Taobao AIGC Content Generation: Short Video Production Techniques
Taobao’s Content AI team leverages a proprietary multimodal Mixture‑of‑Experts model to automatically generate short‑form videos—extracting highlights from live streams and creating customized product explainers—using two‑stage CLIP/VideoBLIP training, character‑level timestamps, LLM re‑segmentation and OCR masking, now producing over 100 k daily videos with a 12 % approval boost and notable conversion gains.
Taobao’s Content AI team presents a series of technical articles summarizing their recent advances in AIGC (AI‑generated content) across the platform, including video generation, image‑text joint generation, and multimodal large‑model research.
The first solution, Highlight Clipping , extracts key moments from massive live‑stream recordings and automatically creates 30‑60 second short videos that capture product highlights while respecting tight user attention spans. The pipeline includes live‑stream parsing, multimodal relevance scoring, and rapid short‑video synthesis.
The second solution, Mixed‑Video Production , leverages original product assets (images, videos, text) to generate customized explanatory videos. Scripts are tailored per product category (e.g., apparel, food) to showcase fitting visual cues such as try‑on clips for clothing or texture shots for food.
Both solutions rely on a proprietary multimodal large model built on a Mixture‑of‑Experts (MoE) architecture. The model training follows a two‑stage strategy: (1) single‑modal CLIP/FLIP pre‑training on billions of e‑commerce image‑text pairs, and (2) VideoBLIP/VideoCoCa adaptation that aligns video frames with textual prompts using tasks like VTC, VTM, VTTG, and VideoFLIP. Knowledge‑distillation (simulation and preference distillation) compresses the large model into efficient inference‑ready versions.
To improve temporal precision, the team replaces coarse sentence‑level ASR with character‑level timestamps, then uses a large language model to re‑segment sentences and align them with visual changes. Multimodal matching further refines segment selection, while OCR‑based sensitive‑information masking ensures compliance.
Business results show daily production of over 100 k short videos, a 12‑point increase in first‑pass approval rates, and significant uplift in conversion metrics during large‑scale campaigns such as Double‑11.
Future work will deepen AIGC integration, explore more expressive generative models, and continue to expand the platform’s content ecosystem. The team invites collaboration and talent recruitment.
你们有没有早上不想做饭的时候?我最近发现了金沙河的这款挂面,不仅原味的劲道爽滑,鸡蛋挂面特别香,还有龙须挂面超细超贴心,速食又好吃,关键是5斤装超耐用!每次煮碗面,暖心又温暖胃,强烈推荐给大家!
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.