Artificial Intelligence 15 min read

Mixture of Multi‑Modal Experts for Advertising Recall

The Mixed‑Modal Expert Model combines ID features with image and text embeddings through optimized representations and conditional output fusion, dramatically improving advertising recall—especially for long‑tail items—and delivering measurable gains in click‑recall, revenue, CTR, and page views in large‑scale online tests.

Alimama Tech

May 29, 2024

Mixture of Multi‑Modal Experts for Advertising Recall

Background

The advertising pipeline consists of recall, coarse ranking, and fine ranking. Recall, as the front‑end, determines the ceiling of business performance. Traditional recall relies on discrete ID features, which are efficient but suffer from incomplete information and poor generalization, especially for long‑tail or cold‑start items.

Content modalities (image, text) provide richer, more generalizable signals but lack the personalization power of IDs. This work explores how to fuse ID and content modalities in a recall model and proposes a Mixed‑Modal Expert Model.

Formal Objective and Retrieval Method

User interest recall models aim to select a top‑k candidate set from the whole inventory based on the probability of user clicking a item. The probability is modeled as a dot product between user and item interest vectors. To reduce online inference cost, a two‑way foil (bi‑directional) retrieval framework is adopted, progressively narrowing candidates from millions to thousands while preserving high recall.

Mixed‑Modal Expert Model

The model addresses three key questions: modality selection, modality representation optimization, and modality fusion.

3.1 Modality Selection

ID features (user gender, age, location, item ID, shop ID) are widely used. Content features (image, text) are abundant on the item side and can be derived from user behavior sequences.

3.2 Representation Optimization

Image and text embeddings are obtained via ViT and BERT, respectively, followed by a contrastive learning stage. ID embeddings are shallow and updated online. The two representations are combined by first pre‑training image‑text embeddings (frozen) and then optimizing ID embeddings with a Sample Softmax loss.

3.3 Modality Fusion

Feature Fusion : Cosine similarities between user and item image‑text features are histogram‑pooled and used as additional features. This yields +1.6 pt click‑recall and +2.2 pt long‑tail click‑recall.

Output Fusion : Separate experts produce interest scores for ID and content modalities; a learnable weighted sum produces the final score. This improves click‑recall by +0.5 pt and long‑tail click‑recall by +3.7 pt.

Conditional Output Fusion : Gate weights are conditioned on the log‑bucketed item click volume, allowing the model to adaptively rely more on content features for cold‑start items and on ID features for hot items. This adds +1.9 pt click‑recall and +2.2 pt long‑tail click‑recall.

Experimental Results

Offline metrics show a +4.0 pt overall click‑recall and +8.1 pt long‑tail click‑recall after adding multimodal features. Online A/B tests on the display platform report +2.33 % revenue, +0.82 % CTR, and +5.24 % PV for long‑tail ads.

About the Team

The Alibaba Mama Match team focuses on large‑scale retrieval, user interest modeling, and cold‑start solutions for display advertising. They welcome talented engineers to join.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning recommendation recall Multimodal Model

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.