How Taobao’s AI Turns Static Clothing Images into Seamless Virtual Try‑On Videos
This article analyzes Taobao’s AIGC video virtual try‑on pipeline, detailing the challenges of frame‑level realism and continuity, the upgraded DiT‑based model, 3D‑VAE compression, large‑scale data collection, template‑matching mechanisms, and the resulting product capabilities for automated marketing and personalized shopper experiences.
Background
Taobao’s content AI team identified a need for low‑cost, high‑timeliness AIGC content across the entire user journey, from feed to search and detail pages. Traditional image‑based virtual try‑on was limited in dynamism and physical realism, prompting a shift toward video‑level virtual fitting.
Problem Definition
The task is to place a specified garment onto a person in a video, ensuring per‑frame realism (skin tone, texture, shape) and temporal coherence (smooth motion, consistent clothing appearance). Challenges include higher data requirements, larger model parameters, and increased computational cost compared to image‑only methods.
Technical Approach
We decompose the problem into two aspects:
Achieve realistic and natural single‑frame results while preserving consistency between the person and the garment.
Maintain smooth motion and consistent clothing attributes across frames.
To address these, we built a high‑resolution, high‑frame‑rate video virtual try‑on system that produces HD, coherent videos.
Model Improvements
Iterated on a DiT‑based img2video backbone as a pre‑training model, improving generalization for fashion e‑commerce.
Introduced a 3D‑VAE to compress spatio‑temporal video data, boosting input resolution and frame rate.
Established a high‑quality video‑level data collection pipeline, continuously expanding diverse training data and designing optimized training/inference schemes.
Evaluation
A comparative analysis between image‑based and video‑based virtual try‑on shows that video can display garments from multiple angles and motions, reveal physical properties (texture, drape), and deliver higher user engagement and information value, albeit with higher difficulty, cost, and computational load.
Product‑Level Capabilities
Three major product capabilities were built:
Automated marketing video generation : The system automatically selects categories lacking marketing videos, generates videos with selling points, and supports batch deployment.
Model‑template generation for merchants : Merchants can upload garment images or IDs, receive matched high‑quality template videos, and produce rich marketing assets.
Buyer‑focused try‑on videos : Users can upload personal videos or images; the system matches suitable templates and generates personalized try‑on videos, enhancing purchase confidence.
Key components include a template library with diverse, high‑quality fashion video templates, a tag‑based matching engine (up to 30 tags per garment/template) powered by large‑scale vision‑language models, and post‑processing steps such as intelligent clipping, copy generation, music, and TTS.
Future Directions
Ongoing work focuses on further improving the base model’s fidelity, expanding the template library, refining the matching algorithm, and extending the solution to more business scenarios to better serve the platform, merchants, and end‑users.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
