ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT is a contrastive self‑supervised framework that fine‑tunes BERT with augmented sentence views and NT‑Xent loss to overcome embedding collapse, delivering roughly 8% higher STS performance than prior methods, remaining robust in few‑shot and supervised scenarios, and now deployed in Meituan’s NLP pipelines.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

Although BERT-based models achieve great success on many downstream NLP tasks, the sentence embeddings directly extracted from BERT are often confined to a very small region of the representation space, resulting in abnormally high similarity scores (the so‑called “collapse” problem). This makes the raw BERT sentence vectors unsuitable for semantic matching tasks.

To address this collapse, the Meituan NLP team proposes ConSERT, a contrastive‑learning based sentence representation transfer method. By fine‑tuning BERT on unlabeled target‑domain corpora with a contrastive objective, ConSERT produces sentence embeddings that better match the data distribution of downstream tasks. Experiments on Semantic Textual Similarity (STS) benchmarks show that ConSERT improves over previous SOTA by about 8% under the same settings, and remains effective in few‑shot scenarios.

1. Background

Sentence representation learning is crucial for many NLP applications such as semantic textual similarity and dense retrieval. Directly averaging token embeddings from BERT yields low‑quality vectors that suffer from high inter‑sentence similarity, especially for high‑frequency tokens.

2. Related Work

Previous approaches include supervised sentence encoders (e.g., SBERT, USE), self‑supervised pre‑training objectives (NSP, MLM, Cross‑Thought), and post‑hoc transformations (BERT‑flow, whitening). Recent contrastive methods such as SimCSE have demonstrated strong performance.

3. Model Overview

ConSERT consists of three components:

A data‑augmentation module that creates two different “views” of the same sentence at the embedding level (adversarial attack, token shuffling, cutoff, dropout).

A shared BERT encoder that maps each view to a sentence vector.

A contrastive loss (NT‑Xent) that pulls together vectors of the same sentence while pushing apart vectors of different sentences within a batch.

The overall training pipeline samples a batch of sentences, generates two augmented versions per sentence, encodes them, and applies the contrastive loss with a temperature hyper‑parameter.

4. Experiments

Experiments are conducted on seven STS datasets (STS‑12 to STS‑16, STS‑b, SICK‑R). Evaluation uses Spearman correlation between model‑predicted cosine similarity and human scores.

4.1 Unsupervised setting : Fine‑tuning BERT on unlabeled STS data yields an 8% relative gain over BERT‑flow.

4.2 Supervised setting : Adding NLI data (SNLI, MNLI) and using three supervision‑fusion strategies (joint, sup‑unsup, joint‑unsup) further improves performance; joint‑unsup achieves the best results.

4.3 Data‑augmentation analysis : Ablation shows that the combination of Token Shuffle and Feature Cutoff gives the highest score (72.74). Single methods rank as Token Shuffle > Token Cutoff > Feature Cutoff ≈ Dropout > None.

4.4 Few‑shot analysis : Even with as few as 100 unlabeled sentences, ConSERT still outperforms baselines, demonstrating strong robustness.

4.5 Temperature study : Optimal temperature lies between 0.08 and 0.12; too high smooths similarities, too low makes the task trivial.

4.6 Batch‑size study : Larger batch sizes improve performance modestly, similar to findings in vision contrastive learning.

5. Conclusion

The work identifies the cause of BERT sentence‑vector collapse and introduces ConSERT, a contrastive self‑supervised framework that consistently improves sentence representations in both unsupervised and supervised fine‑tuning, especially under limited data. The method has been deployed in Meituan’s knowledge‑graph construction, KBQA, and search recall pipelines. The code is open‑sourced at https://github.com/yym6472/ConSERT .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

contrastive learningNLPself-supervisedBERTsemantic similaritysentence representation
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.