Artificial Intelligence 17 min read

Multimodal Large‑Model Cover Generation AI Agent for Taobao Video and Live Streams

Taobao’s new multimodal AI Agent automatically creates high‑quality static and dynamic video covers by planning tasks, consulting a memory of quality criteria, executing frame selection with ReKV streaming and dual‑stage evaluation, generating marketing copy via fine‑tuned Qwen2.5‑7B, and refining layout, resulting in significantly higher click‑through rates, lower latency, and reduced manual effort.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Multimodal Large‑Model Cover Generation AI Agent for Taobao Video and Live Streams

Taobao is shifting from traditional e‑commerce to content‑driven commerce. Low‑quality video or live‑stream covers reduce click‑through rates, especially when users view content on mobile data (static covers) versus Wi‑Fi (dynamic covers).

To address this, the team built a modular AI Agent that automatically generates high‑quality static and dynamic covers using multimodal large models. The system consists of four core modules:

Planning : a large‑language‑model planner parses business requirements, splits them into visual, textual, and layout tasks, and classifies them as mandatory (HavetoHave) or optional (BettertoHave).

Memory : a knowledge base stores cover quality criteria derived from consumption data and guides frame‑selection and evaluation.

Action : executes concrete operations such as long‑video streaming inference (ReKV architecture), dual‑stage intelligent frame selection, marketing copy generation (Qwen2.5‑7B fine‑tuned), and automatic text layout.

Reflection : a quality‑assessment model reviews the generated cover, compares it against Memory criteria, and iteratively refines the result.

The ReKV streaming video engine reduces computation by using sliding‑window attention and KV‑cache, enabling efficient processing of long videos and improving frame‑selection latency.

In the dual‑stage frame selection , a first stage uses Mantis‑8B‑Idefics2 to perform a global scan of the video, while a second stage applies a high‑resolution image quality evaluator with chain‑of‑thought prompting and model ensemble to ensure fine‑grained visual quality.

For marketing copy , a data‑generation pipeline creates content‑title pairs, which are used to fine‑tune Qwen2.5‑7B. The generated titles are concise, highlight core product selling points, and are fed to the layout module.

The layout module predicts optimal text placement by avoiding important visual elements (faces, products) and selects font style and color that harmonize with the background.

Experimental results show significant click‑through improvements over baseline dynamic‑cover solutions and lower GPU memory/latency for video QA tasks (benchmarked against ICLR‑2025 submissions). The system has been deployed across multiple Taobao scenarios, consistently boosting user engagement and reducing manual cover‑creation costs.

Overall, the multimodal large‑model AI Agent provides a flexible, white‑box, and scalable solution for automated cover generation, supporting diverse business requirements while maintaining high visual quality.

AILarge ModelsmultimodalContent AIcover generationvideo processing
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.