Can Generative Models Boost Visual‑Text Retrieval? Introducing GXN

This paper presents GXN, a generative cross‑modal feature learning framework that enhances image‑text retrieval by incorporating both high‑level semantic similarity and fine‑grained local matching through a three‑step Look‑Imagine‑Match process, achieving state‑of‑the‑art results on MSCOCO and Flickr30K.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Can Generative Models Boost Visual‑Text Retrieval? Introducing GXN

Introduction

We have entered a big‑data era where heterogeneous modalities such as text and images grow explosively, posing new challenges for search. Conventional visual‑text cross‑modal representations first encode each modality separately, map them into a shared space, and optimize with a ranking loss that pushes similar image‑text pairs closer than dissimilar ones.

Although this high‑level semantic alignment works well, it neglects local similarities—e.g., color, texture, layout details in images or sentence‑level nuances in text. Inspired by how a skilled painter or writer can “imagine” the expected counterpart, we propose a generative cross‑modal feature learning framework (GXN) that adds an “Imagine” step between looking and matching.

GXN framework diagram
GXN framework diagram

Method

GXN consists of three modules:

Multimodal feature representation (upper region) : an image encoder and two sentence encoders map visual and textual data into a common space. The two sentence encoders learn different levels of features—one captures high‑level semantics, the other captures local, sentence‑level details, which are obtained via a generative model.

Image‑to‑text generation (blue path) : a visual encoder feeds a sentence decoder to generate textual descriptions from visual features. The loss incorporates reinforcement‑learning‑style rewards to maximize similarity between generated and ground‑truth sentences.

Text‑to‑image generative adversarial learning (green path) : a generator creates images from textual features, while a discriminator distinguishes generated images from real ones.

GXN method architecture
GXN method architecture

During inference, only the learned cross‑modal features are needed; similarity between image and text representations is computed to perform retrieval.

Experiments

We evaluate GXN on the MSCOCO and Flickr30K benchmarks. Compared with current state‑of‑the‑art methods, GXN achieves superior retrieval performance, confirming the effectiveness of the generative “Imagine” step.

Experimental results
Experimental results

Conclusion

By integrating image‑to‑text and text‑to‑image generative models into traditional cross‑modal representation learning, GXN captures both high‑level abstract semantics and low‑level detailed features, substantially outperforming existing approaches.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

artificial intelligenceDeep LearningGenerative Modelscross-modal retrievalvisual-text matching
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.