How 58’s Multi‑Label Image Recognition Boosts Semantic Search and Recommendations
This article details the design, data pipeline, model architecture, loss functions, and evaluation metrics of a large‑scale multi‑label image classification system built for 58.com, showing how it improves semantic similarity detection, recommendation, and content moderation across diverse business domains.
Problem Overview
Traditional image‑classification models predict a single label per image, which is insufficient for real‑world pictures that often contain multiple objects. Multi‑label image classification predicts all present categories, providing richer semantic information for downstream services such as recommendation, ad placement, and illegal‑content detection.
Technical Background
Open Datasets
MS‑COCO – 80 classes, >200k annotated images.
PASCAL VOC 2012 – 20 classes, >11k annotated images.
Open Images V6 – 19,958 classes, >9.1M annotated images (mostly machine‑labeled).
NUS‑WIDE – 5,018 classes, >269k images.
Evaluation Metrics
Mean Average Precision (mAP) : average of per‑class Average Precision (AP) over C classes. Both macro‑averaged and micro‑averaged variants are used.
Hamming Loss : fraction of incorrectly predicted labels (false positives + false negatives). A value of 0 indicates perfect prediction.
Representative Algorithms
CNN+LSTM (Regional Latent Semantic Dependencies) : Region proposals from an RPN are encoded and fed to an LSTM to model inter‑region interactions before classification (CVPR 2018).
Cross‑Modality Attention with Semantic Graph : A class‑adjacency graph generates semantic embeddings; CNN feature maps are fused with these embeddings via attention to produce per‑class activation maps (AAAI 2020).
Graph Convolutional Networks (GCN) : A GCN learns a binary classifier for each class and applies it to CNN features, explicitly modeling label co‑occurrence (CVPR 2019).
Asymmetric Loss (ASL) : Addresses extreme positive/negative label imbalance by assigning different weights to positives and negatives and filtering low‑confidence negatives, improving precision without extra computation (Alibaba 2020).
58.com Multi‑Label Solution
Data Construction Pipeline
Run a pre‑trained Open‑Images model on single‑label and unlabeled images to obtain initial multi‑label predictions.
Discard tags whose image count is 0.5‰ of the total (rare tags).
Compute a co‑occurrence matrix; remove tags with correlation 0.01 (low semantic overlap).
Merge remaining tags semantically using WordNet.
This process produced a 5 M‑image dataset with ~720 classes covering 17 high‑level categories within three weeks, far cheaper than manual annotation.
Model Optimization
We adopted the ASL framework:
Replace the final SoftMax layer with a Sigmoid layer of size equal to the total number of classes.
Swap the standard Focal loss for the ASL loss, which (1) weights positive and negative samples differently and (2) filters low‑confidence negatives.
Additional engineering tweaks:
Backbone changed from ResNet‑101 to ResNeXt‑50 for faster inference.
Global Average Pooling replaced by Global Max Pooling, yielding further mAP gains on PASCAL‑VOC.
Experiments on internal validation sets and public benchmarks (MS‑COCO, PASCAL‑VOC) show ASL outperforms standard Cross‑Entropy and Focal loss.
Output Schemes
Two inference outputs are provided:
Multi‑label tag list for downstream recommendation and similarity‑based ranking.
2048‑dimensional feature embedding for nearest‑neighbor search, anomaly detection, and cross‑modal retrieval.
Both outputs can be used independently or jointly depending on the application.
Results and Observations
Switching to ResNeXt‑50 reduced latency by ~30 % while decreasing mAP loss to <1 % on Pascal‑VOC. Changing pooling to max‑pooling added ~0.5 % mAP on both Pascal‑VOC and MS‑COCO. ASL consistently delivered higher precision and lower Hamming loss compared with Cross‑Entropy and Focal loss across all test sets.
References
Zhang J et al., “Multi‑Label Image Classification with Regional Latent Semantic Dependencies,” IEEE Trans. Multimedia, 2016.
You R et al., “Cross‑Modality Attention with Semantic Graph Embedding for Multi‑Label Classification,” AAAI, 2020.
Chen Z M et al., “Multi‑Label Image Recognition With Graph Convolutional Networks,” CVPR, 2019.
Ben‑Baruch E et al., “Asymmetric Loss For Multi‑Label Classification,” 2020.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
