Artificial Intelligence 10 min read

How FashionKLIP Boosts E‑Commerce Image‑Text Retrieval with a Multimodal Knowledge Graph

The ACL 2023 paper introduces FashionKLIP, an e‑commerce visual‑language model enhanced by a multimodal concept knowledge graph, detailing its automated knowledge graph construction, dual‑stream training strategy, and superior performance on FashionGen retrieval benchmarks compared to state‑of‑the‑art methods.

Alibaba Cloud Big Data AI Platform

Jul 11, 2023

How FashionKLIP Boosts E‑Commerce Image‑Text Retrieval with a Multimodal Knowledge Graph

Alibaba Cloud's Machine Learning Platform PAI, together with Prof. Xiao Yanghua's team from Fudan University and Alibaba International Trade Division, presented FashionKLIP at ACL 2023, a visual‑language model that leverages a multimodal e‑commerce concept knowledge graph to improve image‑text retrieval.

Background

Image‑text retrieval is a popular cross‑modal task with strong industrial value. While vision‑language pre‑training (VLP) models have advanced representation learning, e‑commerce data pose unique challenges: textual descriptions often consist of short attribute phrases, and product images typically contain a single item with minimal background.

Model Design

FashionKLIP consists of two main components: (1) an automated pipeline that builds the FashionMMKG, a multimodal concept knowledge graph extracted from large‑scale e‑commerce image‑text data; and (2) a training strategy that injects this knowledge into a dual‑stream VLP model to align image and text representations at the concept level.

FashionMMKG Construction

The knowledge graph is built automatically and includes both textual and visual modalities.

Text modality: Massive fashion texts are mined to determine a set of concepts. Using spaCy for syntactic parsing and POS tagging, multi‑granular concept phrases are extracted. Hierarchical “is‑a” triples are formed (e.g., <"short sleeve t‑shirt in white", is‑a, "short sleeve t‑shirt">) and organized into a dynamic tree that can be expanded with new concepts.

Visual modality: For each concept, a prompt‑based image retrieval selects top‑k images with the highest cosine similarity to the concept’s textual embedding. A Maximal Marginal Relevance (MMR) algorithm ensures diversity among the chosen visual prototypes, which are updated iteratively during training.

FashionKLIP Training

During preprocessing, input texts are parsed for concepts; unmatched new concepts automatically expand the knowledge graph. The model adopts a dual‑stream architecture with separate image and text encoders.

Image‑Text Contrastive (ITC) learning: A CLIP‑style objective aligns global image‑text representations by minimizing contrastive loss for both image‑to‑text and text‑to‑image directions.

Concept‑Visual Alignment (CVA) learning: Multi‑granular concept phrases from the text are mapped to their visual prototypes in FashionMMKG. For each concept, the top‑k most similar images are selected, and a weighted cross‑entropy loss—weighted by similarity scores—encourages alignment between concept embeddings and visual prototypes.

Model Evaluation

FashionKLIP was evaluated on the FashionGen benchmark using both “sample” and “full” settings. In both cases, it outperformed existing state‑of‑the‑art models on e‑commerce image‑text retrieval.

Additional zero‑shot experiments on a product search platform showed superior retrieval performance compared to baseline CLIP models.

The authors plan to release the FashionKLIP code and models in the EasyNLP framework, encouraging the community to apply knowledge‑enhanced pre‑training to other multimodal tasks.

EasyNLP repository: https://github.com/alibaba/EasyNLP

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce vision-language Knowledge Graph multimodal retrieval FashionKLIP

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.