Artificial Intelligence 13 min read

How FashionBERT Boosts E‑Commerce Image‑Text Matching with Patch Embeddings

This article introduces FashionBERT, a multimodal BERT‑based model that replaces ROI‑based image tokens with uniform image patches to overcome e‑commerce specific challenges, details its architecture, adaptive loss balancing, deployment in Alibaba search, and reports significant performance gains on public and internal datasets.

Alibaba Cloud Developer

Jun 2, 2020

How FashionBERT Boosts E‑Commerce Image‑Text Matching with Patch Embeddings

Background

With the rapid development of web technologies, massive multimodal data (text, image, audio, video) appear online. Matching text and images is a fundamental research problem with applications such as cross‑modality retrieval, image captioning, visual question answering, and visual commonsense reasoning. Most academic work focuses on generic domains, while e‑commerce needs dedicated multimodal matching models.

Multimodal Matching Research History

Early methods centered on Canonical Correlation Analysis (CCA) and Visual Semantic Embedding (VSE). Later, deep‑learning‑based approaches like DCCA improved performance, and VSE evolved into methods such as SCAN and PFAN. Since 2019, researchers have applied large‑scale pre‑training (e.g., ViLBERT) to align image and text embeddings.

Challenges of ROI‑Based Methods in E‑Commerce

Applying ViLBERT to e‑commerce suffers from three main issues: (1) few ROI regions per product image, (2) ROI granularity is coarse (object‑level only), and (3) many noisy ROIs (e.g., model heads, hair) that are irrelevant for product matching. Statistics show that generic datasets extract ~19.8 ROIs per image, while e‑commerce images yield only ~6.4.

FashionBERT Image‑Text Matching Model

FashionBERT replaces ROI tokens with uniformly sized image patches (8×8 grid). Each patch is processed by a ResNet to obtain a 2048‑dimensional feature, and 10% of patches are randomly masked during pre‑training. Text tokens use Whole Word Masking, and segment IDs distinguish text ("T") and image patches ("I").

Architecture components:

Text Embedding : Same as original BERT with Whole Word Masking.

Patch Embedding : Image split into 64 patches, each encoded by ResNet, masked patches replaced by zeros.

Cross‑modality FashionBERT : Pre‑trained BERT serves as the backbone, enabling deep fusion of text and patch tokens.

Pre‑training Tasks

FashionBERT is trained with three objectives:

Masked Language Modeling (MLM) – predict masked text tokens.

Masked Patch Modeling (MPM) – reconstruct masked image patches using KL‑divergence loss.

Text‑Image Alignment – predict whether a text‑image pair matches, analogous to Next Sentence Prediction.

To balance these tasks, an adaptive loss algorithm treats task weights as optimization variables, yielding analytical solutions for the weights.

Business Application

FashionBERT is deployed in Alibaba’s multimodal vector search. The model is further continue‑pre‑trained with three segment types: Query ("Q"), Title ("T"), and Image ("I"). A dual‑tower architecture with shared parameters enables efficient online query encoding and offline product encoding, supplemented by co‑occurrence query features and enriched product semantics.

Experimental Results

Using the public FashionGen dataset, FashionBERT outperforms state‑of‑the‑art methods (including ViLBERT) in both image‑text matching and cross‑modality retrieval. On an internal ICBU dataset, FashionBERT variants achieve higher accuracy and AUC than baseline BERT models, and a lightweight two‑layer version with Variable Sequence Length (VSL) meets online latency requirements.

Future Work

Planned improvements include multi‑scale image features, tighter text‑image alignment, incorporation of domain knowledge, and extension to video understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce Deep Learning Multimodal pretraining BERT Image-Text Matching

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.