How ConaCLIP Boosts Lightweight Text-Image Retrieval with Dual‑Encoder Distillation

ConaCLIP introduces a fully‑connected knowledge interaction graph to distill large dual‑encoder models into compact ones, enhancing text‑image retrieval accuracy and efficiency on edge devices, with extensive experiments and supervision strategies demonstrating significant gains over existing baselines.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How ConaCLIP Boosts Lightweight Text-Image Retrieval with Dual‑Encoder Distillation

Background

Text‑Image Retrieval aims to return the most relevant images from a large collection given a textual query. It is a key component of cross‑modal applications such as e‑commerce platforms. Existing models fall into two categories: cross‑encoders, which model deep interactions between text and image but are slow at inference, and dual‑encoders, which encode modalities separately, enabling fast Approximate Nearest Neighbor search.

Although dual‑encoders are preferred for real‑world use, state‑of‑the‑art models like CLIP are still too heavy for edge devices or dynamic indexing scenarios. To address this, the authors focus on distilling large pretrained dual‑encoder models into smaller, faster models during the pre‑training stage.

Algorithm Overview

ConaCLIP proposes a fully‑connected knowledge interaction graph for pre‑training distillation. In addition to intra‑modal teacher‑student interactions, it incorporates intra‑modal student‑student, inter‑modal teacher‑student, and inter‑modal student‑student interactions, as illustrated below.

The fully‑connected graph acts as a multi‑view, multi‑task learning framework that strengthens robustness and effectiveness of the pretrained model. Various supervision strategies are explored and evaluated.

Supervision Strategies

The following loss functions are employed:

InfoNCE loss : a contrastive loss already used in MoTIS.

Feature‑wise distance (FD) loss : minimizes the squared L2 distance between feature vectors.

Similarity‑wise distance (SD) loss : reduces the distance between similarity matrices.

KL‑Div loss : uses Kullback–Leibler divergence to align predicted and target probability distributions.

For SD and KL‑Div losses, the standard approach uses outputs from two teacher networks as targets for two student networks. The authors also experiment with a symmetric version (Sym) that uses paired arrows in Figure 1 as mutual learning targets, deepening interaction among the four encoders.

Supervision Strategy Selection

Experiments show that appropriate supervision strategies can significantly improve each learning type over the baseline. The performance of each type is heavily influenced by the loss function used. The symmetric versions (Sym‑SD and Sym‑KL‑Div) consistently outperform their standard counterparts, and the final method integrates all effective combinations.

Algorithm Accuracy Evaluation

ConaCLIP was evaluated on several standard text‑image retrieval datasets, demonstrating notable improvements across all metrics compared to existing methods and baseline models.

When applied to an end‑to‑end cross‑modal retrieval scenario on Alibaba’s e‑commerce platform, ConaCLIP achieved higher performance while reducing model size and increasing computational efficiency, as shown in the following table.

The method will be contributed to the EasyNLP framework, inviting NLP researchers to adopt it.

EasyNLP repository: https://github.com/alibaba/EasyNLP

AIKnowledge DistillationLightweight Modelsmultimodal retrievalConaCLIPDual Encoder
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.