Mixture of Multi‑Modal Experts for Advertising Recall
The Mixed‑Modal Expert Model combines ID features with image and text embeddings through optimized representations and conditional output fusion, dramatically improving advertising recall—especially for long‑tail items—and delivering measurable gains in click‑recall, revenue, CTR, and page views in large‑scale online tests.
Background
The advertising pipeline consists of recall, coarse ranking, and fine ranking. Recall, as the front‑end, determines the ceiling of business performance. Traditional recall relies on discrete ID features, which are efficient but suffer from incomplete information and poor generalization, especially for long‑tail or cold‑start items.
Content modalities (image, text) provide richer, more generalizable signals but lack the personalization power of IDs. This work explores how to fuse ID and content modalities in a recall model and proposes a Mixed‑Modal Expert Model.
Formal Objective and Retrieval Method
User interest recall models aim to select a top‑k candidate set from the whole inventory based on the probability of user clicking a item. The probability is modeled as a dot product between user and item interest vectors. To reduce online inference cost, a two‑way foil (bi‑directional) retrieval framework is adopted, progressively narrowing candidates from millions to thousands while preserving high recall.
Mixed‑Modal Expert Model
The model addresses three key questions: modality selection, modality representation optimization, and modality fusion.
3.1 Modality Selection
ID features (user gender, age, location, item ID, shop ID) are widely used. Content features (image, text) are abundant on the item side and can be derived from user behavior sequences.
3.2 Representation Optimization
Image and text embeddings are obtained via ViT and BERT, respectively, followed by a contrastive learning stage. ID embeddings are shallow and updated online. The two representations are combined by first pre‑training image‑text embeddings (frozen) and then optimizing ID embeddings with a Sample Softmax loss.
3.3 Modality Fusion
Feature Fusion : Cosine similarities between user and item image‑text features are histogram‑pooled and used as additional features. This yields +1.6 pt click‑recall and +2.2 pt long‑tail click‑recall.
Output Fusion : Separate experts produce interest scores for ID and content modalities; a learnable weighted sum produces the final score. This improves click‑recall by +0.5 pt and long‑tail click‑recall by +3.7 pt.
Conditional Output Fusion : Gate weights are conditioned on the log‑bucketed item click volume, allowing the model to adaptively rely more on content features for cold‑start items and on ID features for hot items. This adds +1.9 pt click‑recall and +2.2 pt long‑tail click‑recall.
Experimental Results
Offline metrics show a +4.0 pt overall click‑recall and +8.1 pt long‑tail click‑recall after adding multimodal features. Online A/B tests on the display platform report +2.33 % revenue, +0.82 % CTR, and +5.24 % PV for long‑tail ads.
About the Team
The Alibaba Mama Match team focuses on large‑scale retrieval, user interest modeling, and cold‑start solutions for display advertising. They welcome talented engineers to join.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.