Artificial Intelligence 27 min read

Multimodal Content Understanding in Baidu Commercial Systems: The ViCAN Model and Its Applications

This article presents Baidu's exploration of multimodal content understanding for commercial advertising, detailing the ViCAN pre‑training model, its contrastive and mask‑language learning tasks, integration across recall, ranking and risk‑control pipelines, quantization with MMDict, and future AIGC‑driven generation, all backed by extensive experiments and Q&A.

DataFunSummit
DataFunSummit
DataFunSummit
Multimodal Content Understanding in Baidu Commercial Systems: The ViCAN Model and Its Applications

The presentation introduces the background and challenges of multimodal content understanding in Baidu's commercial advertising system, where rich media such as images and videos have replaced pure text and require models that capture both scene‑specific differences and common user intents.

To address these issues, Baidu built a unified multimodal pre‑training model called ViCAN. The model leverages large‑scale image‑text pairs (≈100 billion) collected from Baidu Image Search and commercial ad data, and is trained with both coarse‑grained and fine‑grained contrastive learning as well as a multimodal masked language modeling task.

ViCAN’s architecture consists of dual 48‑layer Transformers for text and vision, with contrastive loss that aligns CLS tokens across modalities and additional token‑level alignment between image patches and text tokens. The masked language task uses cross‑attention to predict masked text tokens with visual context, improving cross‑modal reasoning.

Integration of ViCAN into Baidu's ad pipeline enhances three key stages: recall, creative selection, and ranking. In recall, multimodal triggers replace text‑only triggers and multimodal features are added to user‑behavior graphs using a two‑layer domain‑aware aggregation. In the creative stage, ViCAN upgrades image‑text relevance filtering and replaces material‑ID features with continuous multimodal embeddings, improving material freshness and quality. In ranking, dense multimodal features are discretized via the MMDict method, mapping them to sparse IDs that can be combined with other categorical features, yielding higher AUC and better long‑tail performance.

The article also describes the MMDict quantization pipeline, which uses a shared transformer encoder followed by multi‑stage quantization (coarse, residual, product quantization) to produce hierarchical discrete IDs (10⁴‑10⁶ scale) that preserve semantic granularity while remaining efficient for large‑scale recommendation models.

Finally, the authors discuss future directions with AIGC: using ViCAN for image‑to‑text generation to produce fine‑grained descriptions that guide text‑to‑image diffusion models, creating a data‑flywheel that continuously improves both generation and understanding capabilities. A Q&A section clarifies model availability, evaluation metrics, and data‑cleaning practices.

advertisingAIMultimodalAIGCpretrainingLarge-Scale Models
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.