How Contrastive Learning Revolutionizes Product Term Prediction in E‑commerce
By leveraging contrastive learning and large‑scale click‑through data, the article details a dual‑tower model that encodes product titles and queries, explains loss functions, batch‑negative sampling, distributed training tricks, and demonstrates how this approach outperforms traditional NER for product term and category prediction.
Problem Background
Traditional NER methods struggle to extract product terms from e‑commerce titles because many titles are short or omit the product word entirely, leading to noisy or missing predictions.
Why Contrastive Learning?
Inspired by OpenAI’s CLIP and Microsoft’s Turing Bletchley, contrastive learning learns joint representations of related objects (e.g., query and product) by pulling positive pairs together and pushing negatives apart. This self‑supervised approach works with weak signals such as click‑through logs.
Contrastive Learning Principle
The core idea is to map objects into a vector space where similarity reflects relevance. The loss used is InfoNCE, which computes a softmax over one positive and many negatives. Temperature τ controls the sharpness of the distribution.
Loss Function Details
For a batch, each sample is a tuple (Q, P) where Q is the query and P is the product info. The similarity sim(x, y) is usually cosine similarity. With InfoNCE, the denominator includes the positive pair and many negatives, often the whole batch (e.g., 32767 negatives in CLIP).
Batch‑Negative Sampling
Instead of pre‑computing many negatives per sample, the batch itself provides negatives: each other query’s positive becomes a negative for the current query. This dramatically increases negative count without extra forward passes.
Model Design
The final system uses a dual‑tower (siamese) architecture where the same Transformer encoder (6 layers, 512‑dim embeddings, sequence length 100) processes both query and product info. Parameters are fully shared, reducing memory. The encoder outputs are stored in a vector database for fast retrieval.
Training Tricks
Temperature Hyper‑parameter
A constant τ = 0.1 is used after experiments showed that learning τ leads to overly small values that hurt representation quality.
Distributed Training
To increase batch size, training runs on multiple machines (e.g., 5 nodes). Each node computes its own query and product embeddings; an all_gather operation concatenates embeddings across nodes, forming a large similarity matrix (e.g., 3500 × 3500) for loss computation.
Memory‑Saving Techniques
ZeRO sharding: optimizer states, gradients, and parameters are partitioned across nodes, reducing per‑GPU memory.
Activation Checkpointing: only a subset of layer activations are stored; others are recomputed during back‑propagation, cutting memory by ~60% at the cost of ~25% extra compute.
Mixed‑Precision Training: weights and activations use FP16 while updates stay in FP32, saving ~30% memory and speeding up training.
Results
Using a curated product‑term vocabulary, the model accurately retrieves the correct term even when the title lacks it. It also handles multi‑term titles and shows strong zero‑shot performance on hierarchical category prediction, surpassing traditional supervised pipelines.
Deployment
The approach is now used at Youzan for product‑term prediction, category prediction, similar‑item recommendation, search recall, ranking, smart copy generation, and risk control, consistently delivering higher stability than pure supervised models.
References
OpenAI CLIP, Microsoft Turing Bletchley, FaceNet, DSSM, SimCLR, InfoNCE analysis, ZeRO, PyTorch Lightning sharded training, gradient checkpointing, mixed‑precision training, and related arXiv papers.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
