Artificial Intelligence 23 min read

Multimodal Recall Solution for KDD Cup 2020: ImageBERT and LXMERT Based Approach

The second‑place team tackled KDD Cup 2020’s Multimodal Recall challenge by fine‑tuning ImageBERT and LXMERT on query‑image pairs, generating negatives, applying AMSoftmax and multi‑similarity losses, ensembling weighted predictions, and using score‑based post‑processing, boosting NDCG@5 to 0.8352 and powering Meituan’s multimodal search pipeline.

Meituan Technology Team

Sep 24, 2020

Multimodal Recall Solution for KDD Cup 2020: ImageBERT and LXMERT Based Approach

The ACM SIGKDD Conference on Knowledge Discovery and Data Mining is a top international venue for data mining. In KDD Cup 2020, four tracks were set up, including Debiasing, Multimodalities Recall, AutoGraph, and adversarial learning. Meituan Search’s advertising algorithm team won the champion in the Debiasing track and also achieved top positions in the other tracks.

This article introduces the technical solution of the second‑place team in the Multimodalities Recall track and its practical application in Meituan Search.

1. Background

Meituan Search is a typical multimodal search engine where queries (text) need to retrieve results across multiple modalities such as images, videos, and POI data. Ensuring relevance between a query and multimodal results is challenging.

The Multimodalities Recall task asks participants to rank product images for a given query and return the top‑5 most relevant images.

2. Task Description

The competition provides a large e‑commerce dataset containing query‑image pairs. The training set contains only positive pairs, while the validation and test sets contain many candidate images per query. Evaluation uses Normalized Discounted Cumulative Gain at 5 (NDCG@5).

Example Query leopard-print women's shoes The left image is relevant to the query, while the right image is not.

3. Classic Solutions

Two common multimodal matching paradigms are:

Map each modality to its own feature space and learn an implicit distance function via deep interaction.

Map modalities into a shared space and compute an explicit similarity score.

Recent vision‑language pre‑training (VLP) models such as ImageBERT and LXMERT follow the second paradigm.

4. Our Method: Transformer‑Based Ensembled Models (TBEM)

We built three base models: ImageBERT‑A, ImageBERT‑B, and LXMERT. All models use the 2048‑dim image features extracted by Faster‑RCNN, so no raw image processing is required.

4.1 Data Analysis & Processing

Construct negative samples by replacing the query in positive pairs.

Set the maximum number of object boxes to 10 and the maximum query length to 20 tokens.

Apply post‑processing strategies based on the distribution of query‑image scores.

4.2 Model Construction & Training

We adopted the latest VLP models:

LXMERT : added target‑box class text features to the visual stream and used a two‑layer fully connected head for binary relevance classification (GeLU + LayerNorm + Cross‑Entropy).

ImageBERT‑A : trained only on the relevance task (no MLM), used a unified segment embedding (0), and optimized with AMSoftmax loss.

ImageBERT‑B : kept position embeddings for image boxes removed, used separate segment embeddings (0 for text, 1 for image), and trained with the same loss as LXMERT.

All models were initialized from BERT‑Base weights and fine‑tuned on the generated training data.

4.3 Loss Functions & Fine‑Tuning

AMSoftmax for ImageBERT‑A.

Multi‑Similarity Loss combined with Cross‑Entropy for LXMERT.

Same loss setup for ImageBERT‑B as LXMERT.

After fine‑tuning, we performed data oversampling based on query similarity between the training set and test‑B set, and further fine‑tuned each model.

4.4 Ensemble & Post‑Processing

We ensembled the four prediction files (LXMERT, ImageBERT‑A, ImageBERT‑B, ImageBERT‑A′) using weighted sum (weights 0.3:0.2:0.3:0.2) determined by grid search on the validation set.

Post‑processing steps:

If the score gap between the top‑1 and top‑2 images for a query exceeds a threshold, keep only the top‑1 pair.

Otherwise, remove all pairs containing that image to avoid duplicate image‑query assignments.

These strategies boosted NDCG@5 from 0.7098 to 0.8352 on the validation set.

5. Application in Meituan Search

The multimodal retrieval pipeline is deployed across Meituan’s five‑stage search architecture (data, recall, ranking, re‑ranking, presentation). Multimodal representations (ImageBERT) and fusion improve candidate recall, ranking quality, and user experience for both image and video search.

6. Conclusion

We presented a multimodal recall solution based on ImageBERT and LXMERT, enhanced by data preprocessing, model ensemble, and post‑processing. The approach achieved strong performance in KDD Cup 2020 and demonstrates the feasibility of transferring cutting‑edge vision‑language models to large‑scale e‑commerce search scenarios.

References

Ba et al., Layer Normalization, 2016.

Hendrycks & Gimpel, GeLU, 2016.

Qi et al., ImageBERT, 2020.

Tan & Bansal, LXMERT, 2019.

Wang et al., AMSoftmax, 2018.

Wang et al., Multi‑Similarity Loss, 2019.

Li et al., Oscar, 2020.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data mining Transformer multimodal retrieval NDCG ImageBERT KDD Cup 2020 LXMERT

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.