Artificial Intelligence 17 min read

Multimodal Pretraining for Search Recall in E-commerce

The paper proposes a multimodal pre‑training framework that jointly encodes query text and item titles with images via shared and single‑stream towers, using MLM, MPM, QIC, and matching tasks, and demonstrates substantial Recall@K gains on a billion‑item e‑commerce catalog by leveraging visual cues to bridge the semantic gap.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Multimodal Pretraining for Search Recall in E-commerce

Search recall is the foundation of a search system, determining the upper bound of performance improvement. In e‑commerce, queries and items have a natural cross‑modal retrieval need, yet traditional recall relies mainly on textual features such as titles, ignoring visual information.

We explore multimodal pre‑training combined with recall to bring incremental value. The proposed approach introduces a text‑image pre‑training model where a Query tower and an Item tower (containing both title and image) are encoded separately and then fused by a cross‑modal encoder. The Item tower uses a single‑stream design for title‑image fusion, while the Query tower shares the encoder with the Item tower to bridge the semantic gap between short queries and long, SEO‑optimized titles.

Four pre‑training tasks are designed to align with the downstream vector‑recall task:

Masked Language Modeling (MLM) on Query and Title tokens.

Masked Patch Modeling (MPM) on image patches.

Query‑Item Classification (QIC) using a dual‑tower inner‑product and AM‑Softmax loss.

Query‑Item Matching (QIM) and Query‑Image Matching (QIM2) to directly model query‑item and query‑image similarity.

During downstream training, the pre‑trained vectors are projected by a fully‑connected layer and combined with ID embeddings. The combined representation is fed into a dual‑tower vector‑recall model that scores billions of candidates via inner‑product. Sampled Softmax with massive negative sampling is used to approximate full‑softmax probabilities.

Extensive experiments on a 1‑billion‑item pool show that the dual‑tower multimodal model significantly outperforms single‑tower baselines on Recall@K and improves item relevance. Ablation studies confirm the importance of (1) placing MLM/MPM in the Item tower, (2) sharing encoders for Query and Item, and (3) using AM‑Softmax in QIC.

Further analysis demonstrates that the multimodal model captures visual cues (e.g., color, pattern) absent from titles, thereby reducing the semantic gap and providing valuable incremental recall results.

In summary, a text‑image pre‑training framework tightly coupled with vector recall yields notable gains in large‑scale e‑commerce search, and the methodology can be extended to other scenarios such as product understanding and ranking.

e-commercedeep learningmultimodalPretrainingsearch recallVector Retrieval
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.