Artificial Intelligence 6 min read

Can Generative Models Boost Visual‑Text Retrieval? Introducing GXN

This paper presents GXN, a generative cross‑modal feature learning framework that enhances image‑text retrieval by incorporating both high‑level semantic similarity and fine‑grained local matching through a three‑step Look‑Imagine‑Match process, achieving state‑of‑the‑art results on MSCOCO and Flickr30K.

Alibaba Cloud Developer

Jul 19, 2018

Can Generative Models Boost Visual‑Text Retrieval? Introducing GXN

Introduction

We have entered a big‑data era where heterogeneous modalities such as text and images grow explosively, posing new challenges for search. Conventional visual‑text cross‑modal representations first encode each modality separately, map them into a shared space, and optimize with a ranking loss that pushes similar image‑text pairs closer than dissimilar ones.

Although this high‑level semantic alignment works well, it neglects local similarities—e.g., color, texture, layout details in images or sentence‑level nuances in text. Inspired by how a skilled painter or writer can “imagine” the expected counterpart, we propose a generative cross‑modal feature learning framework (GXN) that adds an “Imagine” step between looking and matching.

Method

GXN consists of three modules:

Multimodal feature representation (upper region) : an image encoder and two sentence encoders map visual and textual data into a common space. The two sentence encoders learn different levels of features—one captures high‑level semantics, the other captures local, sentence‑level details, which are obtained via a generative model.

Image‑to‑text generation (blue path) : a visual encoder feeds a sentence decoder to generate textual descriptions from visual features. The loss incorporates reinforcement‑learning‑style rewards to maximize similarity between generated and ground‑truth sentences.

Text‑to‑image generative adversarial learning (green path) : a generator creates images from textual features, while a discriminator distinguishes generated images from real ones.

During inference, only the learned cross‑modal features are needed; similarity between image and text representations is computed to perform retrieval.

Experiments

We evaluate GXN on the MSCOCO and Flickr30K benchmarks. Compared with current state‑of‑the‑art methods, GXN achieves superior retrieval performance, confirming the effectiveness of the generative “Imagine” step.

Conclusion

By integrating image‑to‑text and text‑to‑image generative models into traditional cross‑modal representation learning, GXN captures both high‑level abstract semantics and low‑level detailed features, substantially outperforming existing approaches.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

artificial-intelligence deep learning generative models cross-modal retrieval visual-text matching

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.