Contrastive Learning Perspective on Retrieval and Reranking Models in Recommendation Systems
This article explains how contrastive learning, originally popular in computer‑vision, can be interpreted and applied to recommendation‑system recall and coarse‑ranking models, covering its theoretical roots, typical architectures like SimCLR, MoCo and SwAV, and practical tricks such as in‑batch negatives, embedding normalization, temperature scaling, and graph‑based extensions.
The talk introduces contrastive learning from a metric‑learning viewpoint, describing its origins in self‑supervised image tasks and its recent expansion to NLP and recommendation domains. It explains that contrastive learning seeks to pull positive pairs together while pushing negatives apart, typically using the InfoNCE loss.
Key components of a contrastive system are (1) how positives are constructed (e.g., data augmentations or instance discrimination), (2) the encoder architecture (often a ResNet or transformer followed by a projector), and (3) the loss function (InfoNCE with a temperature parameter). Alignment (bringing positives close) and uniformity (dispersing embeddings uniformly) are identified as essential properties.
Typical image‑domain models are presented: SimCLR (in‑batch negatives, dual‑tower ResNet + projector), MoCo (momentum encoder and a large negative queue), and SwAV (cluster‑based contrastive learning). Their design choices illustrate how to avoid representation collapse and improve performance.
The article then maps these ideas to recommendation‑system recall and coarse‑ranking models, which commonly use a dual‑tower architecture. It argues that such models are essentially contrastive learners when (a) in‑batch negative sampling is used, (b) user and item embeddings are L2‑normalized (cosine similarity), and (c) a temperature scaling factor is added to the loss, mirroring InfoNCE.
Practical insights include: increasing batch size provides more negatives, temperature scaling focuses the loss on hard negatives, and embedding normalization improves training stability and linear separability. Extensions such as adding a contrastive auxiliary loss on the item side (using dropout or feature masking) help long‑tail items, while applying the same idea to the user side can further boost performance.
Beyond dual‑tower models, the article discusses graph‑based recall using GNNs, where sub‑graph augmentations (node dropping, edge perturbation, feature masking, random walks) generate positive views for contrastive training. This approach can produce more robust user/item embeddings for sparse data scenarios.
Finally, the talk suggests future directions, such as integrating contrastive objectives into ranking models, exploring new positive‑pair constructions, and leveraging the alignment‑uniformity theory to guide model design across the recommendation pipeline.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.