How 58’s Multi‑Label Image Recognition Boosts Semantic Search and Recommendations

This article details the design, data pipeline, model architecture, loss functions, and evaluation metrics of a large‑scale multi‑label image classification system built for 58.com, showing how it improves semantic similarity detection, recommendation, and content moderation across diverse business domains.

ITPUB
ITPUB
ITPUB
How 58’s Multi‑Label Image Recognition Boosts Semantic Search and Recommendations

Problem Overview

Traditional image‑classification models predict a single label per image, which is insufficient for real‑world pictures that often contain multiple objects. Multi‑label image classification predicts all present categories, providing richer semantic information for downstream services such as recommendation, ad placement, and illegal‑content detection.

Technical Background

Open Datasets

MS‑COCO – 80 classes, >200k annotated images.

PASCAL VOC 2012 – 20 classes, >11k annotated images.

Open Images V6 – 19,958 classes, >9.1M annotated images (mostly machine‑labeled).

NUS‑WIDE – 5,018 classes, >269k images.

Evaluation Metrics

Mean Average Precision (mAP) : average of per‑class Average Precision (AP) over C classes. Both macro‑averaged and micro‑averaged variants are used.

Hamming Loss : fraction of incorrectly predicted labels (false positives + false negatives). A value of 0 indicates perfect prediction.

Representative Algorithms

CNN+LSTM (Regional Latent Semantic Dependencies) : Region proposals from an RPN are encoded and fed to an LSTM to model inter‑region interactions before classification (CVPR 2018).

Cross‑Modality Attention with Semantic Graph : A class‑adjacency graph generates semantic embeddings; CNN feature maps are fused with these embeddings via attention to produce per‑class activation maps (AAAI 2020).

Graph Convolutional Networks (GCN) : A GCN learns a binary classifier for each class and applies it to CNN features, explicitly modeling label co‑occurrence (CVPR 2019).

Asymmetric Loss (ASL) : Addresses extreme positive/negative label imbalance by assigning different weights to positives and negatives and filtering low‑confidence negatives, improving precision without extra computation (Alibaba 2020).

58.com Multi‑Label Solution

Data Construction Pipeline

Run a pre‑trained Open‑Images model on single‑label and unlabeled images to obtain initial multi‑label predictions.

Discard tags whose image count is 0.5‰ of the total (rare tags).

Compute a co‑occurrence matrix; remove tags with correlation 0.01 (low semantic overlap).

Merge remaining tags semantically using WordNet.

This process produced a 5 M‑image dataset with ~720 classes covering 17 high‑level categories within three weeks, far cheaper than manual annotation.

Model Optimization

We adopted the ASL framework:

Replace the final SoftMax layer with a Sigmoid layer of size equal to the total number of classes.

Swap the standard Focal loss for the ASL loss, which (1) weights positive and negative samples differently and (2) filters low‑confidence negatives.

Additional engineering tweaks:

Backbone changed from ResNet‑101 to ResNeXt‑50 for faster inference.

Global Average Pooling replaced by Global Max Pooling, yielding further mAP gains on PASCAL‑VOC.

Experiments on internal validation sets and public benchmarks (MS‑COCO, PASCAL‑VOC) show ASL outperforms standard Cross‑Entropy and Focal loss.

Output Schemes

Two inference outputs are provided:

Multi‑label tag list for downstream recommendation and similarity‑based ranking.

2048‑dimensional feature embedding for nearest‑neighbor search, anomaly detection, and cross‑modal retrieval.

Both outputs can be used independently or jointly depending on the application.

Results and Observations

Switching to ResNeXt‑50 reduced latency by ~30 % while decreasing mAP loss to <1 % on Pascal‑VOC. Changing pooling to max‑pooling added ~0.5 % mAP on both Pascal‑VOC and MS‑COCO. ASL consistently delivered higher precision and lower Hamming loss compared with Cross‑Entropy and Focal loss across all test sets.

References

Zhang J et al., “Multi‑Label Image Classification with Regional Latent Semantic Dependencies,” IEEE Trans. Multimedia, 2016.

You R et al., “Cross‑Modality Attention with Semantic Graph Embedding for Multi‑Label Classification,” AAAI, 2020.

Chen Z M et al., “Multi‑Label Image Recognition With Graph Convolutional Networks,” CVPR, 2019.

Ben‑Baruch E et al., “Asymmetric Loss For Multi‑Label Classification,” 2020.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Computer VisionrecommendationDeep Learningimage recognitionmulti-label classificationasymmetric losslarge-scale data
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.