Artificial Intelligence 16 min read

How Contrastive Learning Revolutionizes Product Term Prediction in E‑commerce

By leveraging contrastive learning and large‑scale click‑through data, the article details a dual‑tower model that encodes product titles and queries, explains loss functions, batch‑negative sampling, distributed training tricks, and demonstrates how this approach outperforms traditional NER for product term and category prediction.

Youzan Coder

Jul 11, 2022

How Contrastive Learning Revolutionizes Product Term Prediction in E‑commerce

Problem Background

Traditional NER methods struggle to extract product terms from e‑commerce titles because many titles are short or omit the product word entirely, leading to noisy or missing predictions.

Why Contrastive Learning?

Inspired by OpenAI’s CLIP and Microsoft’s Turing Bletchley, contrastive learning learns joint representations of related objects (e.g., query and product) by pulling positive pairs together and pushing negatives apart. This self‑supervised approach works with weak signals such as click‑through logs.

Contrastive Learning Principle

The core idea is to map objects into a vector space where similarity reflects relevance. The loss used is InfoNCE, which computes a softmax over one positive and many negatives. Temperature τ controls the sharpness of the distribution.

Loss Function Details

For a batch, each sample is a tuple (Q, P) where Q is the query and P is the product info. The similarity sim(x, y) is usually cosine similarity. With InfoNCE, the denominator includes the positive pair and many negatives, often the whole batch (e.g., 32767 negatives in CLIP).

Batch‑Negative Sampling

Instead of pre‑computing many negatives per sample, the batch itself provides negatives: each other query’s positive becomes a negative for the current query. This dramatically increases negative count without extra forward passes.

Model Design

The final system uses a dual‑tower (siamese) architecture where the same Transformer encoder (6 layers, 512‑dim embeddings, sequence length 100) processes both query and product info. Parameters are fully shared, reducing memory. The encoder outputs are stored in a vector database for fast retrieval.

Training Tricks

Temperature Hyper‑parameter

A constant τ = 0.1 is used after experiments showed that learning τ leads to overly small values that hurt representation quality.

Distributed Training

To increase batch size, training runs on multiple machines (e.g., 5 nodes). Each node computes its own query and product embeddings; an all_gather operation concatenates embeddings across nodes, forming a large similarity matrix (e.g., 3500 × 3500) for loss computation.

Memory‑Saving Techniques

ZeRO sharding: optimizer states, gradients, and parameters are partitioned across nodes, reducing per‑GPU memory.

Activation Checkpointing: only a subset of layer activations are stored; others are recomputed during back‑propagation, cutting memory by ~60% at the cost of ~25% extra compute.

Mixed‑Precision Training: weights and activations use FP16 while updates stay in FP32, saving ~30% memory and speeding up training.

Results

Using a curated product‑term vocabulary, the model accurately retrieves the correct term even when the title lacks it. It also handles multi‑term titles and shows strong zero‑shot performance on hierarchical category prediction, surpassing traditional supervised pipelines.

Deployment

The approach is now used at Youzan for product‑term prediction, category prediction, similar‑item recommendation, search recall, ranking, smart copy generation, and risk control, consistently delivering higher stability than pure supervised models.

References

OpenAI CLIP, Microsoft Turing Bletchley, FaceNet, DSSM, SimCLR, InfoNCE analysis, ZeRO, PyTorch Lightning sharded training, gradient checkpointing, mixed‑precision training, and related arXiv papers.

contrastive learning dual-tower E-commerce AI Distributed Training InfoNCE product term prediction

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.