How Taobao’s “Faxiang” AI Model Revolutionizes E‑Commerce Video Generation
Taobao’s AIGC video generation platform, built on a large‑scale “Faxiang” model that evolved from UNet to DiT, leverages over 2 billion curated e‑commerce videos, expert alignment, Lora fine‑tuning, and multi‑control capabilities to deliver diverse, high‑quality product videos that dramatically boost conversion metrics across the marketplace.
Introduction
Taobao has integrated AI‑generated content (AIGC) throughout its user journey, from discovery feeds to product detail pages, to reduce content production costs and accelerate the consumer ecosystem. Over the past year the team has advanced video generation, multimodal text‑image synthesis, personalized copy, and persona agents, culminating in a series of technical papers.
Model Evolution and Data Foundation
The core video‑generation model, named Faxiang , transitioned from a UNet architecture to a Diffusion Transformer (DiT) architecture after a year‑and‑a‑half of research, iteration, and data accumulation. The team built a pipeline that cleans and annotates more than 2 billion high‑quality e‑commerce videos, focusing on apparel categories.
Technical Advantages
Massive domain‑specific data: Continuous collection, cleaning, and labeling of e‑commerce marketing videos ensure rich training material.
Expert alignment: Human e‑commerce experts score generated outputs to correct hand distortions, unnatural expressions, and style mismatches, feeding preference data back into model alignment.
Lora fine‑tuning system: Modular LoRA adapters add marketing copy, camera motion, lighting, and scene‑change capabilities while keeping base model updates inexpensive.
Rich control interface: Text prompts, motion amplitude, and camera‑movement controls let a single image produce multiple video styles.
Derived model matrix: Includes video try‑on, background replacement, video‑to‑video, video extension, action‑driven video, voice‑driven video, and virtual‑human generation, all combinable for product‑level solutions.
Model Characteristics
Versatile fashion presentation: Generates videos for child wear, professional attire, and casual styles, with adjustable resolution and duration.
High success rate: Low rates of hand distortion, body clipping, and unrealistic poses, ranking among the best in industry benchmarks.
Deep e‑commerce understanding: Training data and expert alignment give the model strong awareness of fashion context, producing appropriate expressions and motions for each clothing type.
Strong generalization: Works well on synthetic models, real‑world photos, studio shots, and swapped‑clothing images.
Business Impact
The video generation suite now accounts for over 50% of Taobao’s total video volume. AI‑generated videos achieve 70% higher click‑through rate (CTR) and 50% higher click‑through conversion rate (CTCVR) compared to non‑AI videos. Cumulative exposure exceeds 4.5 billion views, driving 30% of total purchase conversations and 50% of GMV, with a conversion efficiency 2.7× that of traditional videos.
Application Scenarios
Scenario 1 – Flat‑lay image to product video: Merchants upload a flat clothing image and receive a 5‑15 s video with selling points, already live in the “QianNiu‑Business Manager” tool.
Scenario 2 – Model photo to product and seed videos: A set of model photos generates individual video clips that are stitched into a longer video, with optional content‑seed videos preserving existing marketing copy.
Scenario 3 – Virtual‑human mixed‑clip video: Combines generated product explanations with a virtual avatar that can virtually “try on” the clothing.
Scenario 4 – Video try‑on: Replaces clothing in an existing model video with new items, expanding video assets for other products.
Scenario 5 – Video‑to‑video: Modifies motion and scene in an existing video to create new, copyright‑clear content.
Scenario 6 – Background replacement: Swaps the background of a video to fit different marketing contexts.
Scenario 7 – Action‑driven video: Replicates typical fashion‑show movements to produce ready‑to‑publish clips.
Scenario 8 – Video outpainting: Extends a video to multiple resolutions and aspect ratios for varied platforms.
Scenario 9 – Virtual‑human narration: Integrates face generation, video try‑on, and lip‑sync to create diverse virtual‑human presenters.
Conclusion
The “Faxiang” model demonstrates how large‑scale, domain‑specific AI can deliver controllable, high‑quality video synthesis that directly fuels e‑commerce growth, reduces production costs, and opens new creative possibilities for merchants.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
