Artificial Intelligence 16 min read

Boosting Bag Item Identification with Metric Learning: A ZhiZhuan Case Study

ZhiZhuan’s in‑house “photo‑to‑SKU” system tackles large‑scale bag identification by combining dual‑stage object detection, metric‑learning‑based embedding training, and a hybrid vector‑plus‑scalar retrieval pipeline, achieving superior top‑K accuracy over third‑party solutions while addressing fine‑grained visual nuances and long‑tail SKU coverage.

Zhuanzhuan Tech

Apr 15, 2026

Boosting Bag Item Identification with Metric Learning: A ZhiZhuan Case Study

Background

In the second‑hand luxury‑goods platform, matching uploaded bag images to a massive SKU catalog is required for inventory standardization and for a user‑facing “search by image” experience. Manual matching cannot scale, so an AI‑driven “photo‑to‑SKU” capability is built.

Problem Definition

The task is an image‑retrieval problem: given one or more photos of a bag, retrieve the most similar SKU images from a large catalog and infer the corresponding SKU. Fine‑grained discrimination is needed for logo, shape, material, hardware, and color.

Technical Options

Two families exist: classification‑based models and retrieval‑based models. Classification does not scale to the million‑level, long‑tail SKU space, so the solution adopts a retrieval architecture that embeds images into a high‑dimensional vector space and performs nearest‑neighbor search.

Metric Learning vs. Contrastive Learning

Both are representation‑learning paradigms. Metric learning uses explicit SKU labels to enforce tight intra‑class clusters and large inter‑class margins, which is essential for fine‑grained bag identification. Contrastive learning relies on self‑ or weak‑supervision and optimizes generic visual similarity. The system therefore chooses a metric‑learning‑centric pipeline.

System Architecture

The end‑to‑end flow consists of three core modules:

Subject detection & segmentation to isolate the bag from complex backgrounds.

Feature extraction via a dedicated embedding network that converts the cropped bag into a high‑dimensional vector (the “digital fingerprint”).

Hybrid retrieval that combines vector similarity (Faiss‑GPU) with scalar SKU attributes (category, brand, series) to return the top‑K most similar SKUs.

Metric‑Learning Training

Data Construction

Training data includes millions of SKU‑level images with multiple views per SKU. A “bad‑case‑driven” augmentation pipeline collects frequent mis‑classifications, analyzes failure causes (occlusion, lighting, angle), and generates targeted augmentations. Online hard‑sample mining is applied per batch to focus learning on confusing pairs. Anchor‑Positive‑Negative (APN) triplets are built, with negatives selected by similarity mining to strengthen fine‑grained discrimination.

Model Tricks

Cosine similarity is used as the distance metric because it is scale‑invariant and works well with normalized embeddings.

Learning‑rate decay: a high initial rate accelerates convergence, followed by a lower rate for fine‑tuning the embedding space.

Loss functions such as Triplet Loss or Circle Loss directly optimize intra‑class compactness and inter‑class separation, which is more suitable than classification loss for a million‑class scenario.

Engineering Optimizations

Inference is served with TorchServe , enabling multi‑instance parallelism and batch processing for high‑throughput online requests.

A Faiss‑GPU index is built offline for the million‑scale SKU vectors, delivering a 30‑50× speedup over CPU search.

Incremental index updates are performed on an hourly basis to incorporate newly added SKUs without rebuilding the entire index.

Cosine scores are normalized to the range [0,1]; a dynamic threshold can be tuned per business requirement.

Results and Deployment

The system is deployed in two scenarios: (1) internal SKU attachment in the inventory pipeline, and (2) the consumer‑facing “photo‑to‑SKU” feature in the mobile app. Internal evaluation shows a Top‑4 accuracy improvement of over 8 % compared with third‑party baselines, especially on images with complex backgrounds.

Future Improvements

Current limitations include the inability to infer bag scale without a reference object and insufficient modeling of extremely fine‑grained details such as hardware and texture. Planned work involves adding explicit scale references, enhancing local feature extraction (e.g., attention on hardware regions), and exploring multimodal retrieval that fuses textual and structured attribute information with visual embeddings.

Conclusion

The “photo‑to‑SKU” capability is realized by a detection‑segmentation front‑end, a metric‑learning‑driven embedding backbone, and a hybrid vector‑plus‑scalar retrieval engine. Metric learning provides the supervised margins needed for fine‑grained discrimination across a million‑scale, long‑tail SKU catalog, while engineering optimizations (TorchServe, Faiss‑GPU, incremental indexing) ensure production‑grade latency and scalability.

Code example

agisora

deep learning Embedding metric learning image retrieval bag identification visual similarity

Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.