Artificial Intelligence 10 min read

Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

Taobao’s AIGC pipeline combines a human‑feedback multimodal reward model, audio‑visual joint pre‑training, and Mixture‑of‑Experts distillation to clean data, align outputs with user preferences, and achieve state‑of‑the‑art multimodal LLM performance that drives content cold‑start and conversion gains in e‑commerce.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Taobao Content AI: Summary of AIGC Content Generation and Multimodal Model Techniques

As a new form of product presentation, AIGC content appears throughout the Taobao user journey—from feed recommendations to search decisions and detail‑page promotions. Over the past year, continuous breakthroughs in video generation and image‑text joint generation have enabled large‑scale deployment of AIGC across multiple Taobao scenarios.

Project Background : Content interaction tasks involve diverse data modalities. To boost performance, we need multimodal large language models (MLLM) that can jointly process text, images, video, and audio.

Data Objectives : Build an automated pipeline for high‑quality multimodal data optimization, improving consistency, alignment, and training efficiency.

Model Objectives : Enhance model performance and training efficiency by (1) fusing multiple modalities and (2) aligning model outputs with human preferences.

Solution Overview

Human‑feedback multimodal reward model : We construct a reward model trained on a human‑annotated dataset (HF‑dataset) that captures alignment between images and captions. Pairwise comparisons are modeled with a Bradley‑Terry loss, yielding a fine‑grained reward signal.

High‑quality data cleaning : Noisy and redundant multimodal samples are filtered using the reward model, resulting in a cleaner dataset that significantly improves downstream task performance.

Algorithm design : The cleaned data feed a multimodal reward model that outperforms CLIP and BLIP baselines. The model is then used for (a) multimodal reward‑based data selection and (b) instruction‑following fine‑tuning (SFT) followed by RLHF to align with user preferences.

Audio‑Visual Joint Pre‑training

We propose an audio‑visual MLLM architecture that aligns visual and auditory signals for richer video understanding. The pipeline includes (1) dense caption generation for video frames and audio, (2) GPT‑4‑driven creation of multi‑turn QA pairs, and (3) multimodal alignment using the reward model. This approach achieves state‑of‑the‑art results on video QA benchmarks (e.g., MSR‑VTT‑QA, ActivityNet‑QA, MUSIC‑AVQA).

Multimodal Expert Model Distillation

To balance performance and efficiency, we employ a Mixture‑of‑Experts (MoE) small model guided by knowledge distillation from a large teacher. Two stages are used: (1) general‑to‑specialized imitation distillation to transfer complex knowledge, and (2) preference distillation to reduce hallucinations by teaching the MoE model what constitutes “good” versus “bad” outputs.

The resulting MoE model demonstrates superior complex‑understanding and hallucination‑mitigation capabilities compared to similarly sized baselines, while using less than 1% of the original training data.

Conclusion

In e‑commerce, multimodal LLMs have already delivered measurable gains in content cold‑start and conversion uplift. Future work will focus on tighter integration of business goals, user profiling, and real‑time feedback to generate highly personalized interactive content.

large language modelmultimodalAIGCContent generationData OptimizationReward Model
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.