Artificial Intelligence 20 min read

ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT introduces a contrastive self‑supervised framework that enhances BERT‑derived sentence embeddings by applying efficient embedding‑level data augmentations, achieving significant improvements on semantic textual similarity tasks, especially in low‑resource settings, and outperforming previous state‑of‑the‑art methods.

DataFunTalk
DataFunTalk
DataFunTalk
ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

1. Background

Sentence representation learning is crucial for many NLP tasks such as semantic textual similarity (STS) and dense text retrieval. BERT‑based models, while powerful, produce collapsed sentence embeddings that are overly similar regardless of semantic content, largely due to the dominance of high‑frequency words.

The authors identify two causes of this collapse: (1) BERT maps most sentences into a small region of the embedding space, leading to high similarity scores even for unrelated sentences, and (2) high‑frequency words dominate the average‑pooling of token embeddings, further reducing semantic discrimination.

2. Related Work

2.1 Sentence Representation Learning

Early supervised methods leveraged Natural Language Inference (NLI) datasets (e.g., SNLI, MNLI) with BiLSTM or Transformer encoders (e.g., Universal Sentence Encoder, SBERT). Self‑supervised approaches such as BERT’s Next Sentence Prediction (NSP) and later variants (Cross‑Thought, CMLM, SLM) aim to learn sentence embeddings without labeled data. Unsupervised transfer methods like BERT‑flow, BERT‑whitening, and SimCSE improve the quality of BERT embeddings by post‑processing or contrastive learning.

2.2 Contrastive Learning

Contrastive learning, popularized in computer vision (MoCo, SimCLR), has been adapted to NLP for sentence encoding. It treats different augmented views of the same sentence as positives and other sentences in the batch as negatives, encouraging the model to bring positives closer while pushing negatives apart.

3. Model

3.1 Problem Definition

Given a pretrained language model (e.g., BERT) and an unlabeled corpus from the target domain, the goal is to fine‑tune the model via a self‑supervised task so that the resulting sentence embeddings perform well on downstream semantic matching tasks.

3.2 Contrastive Framework (ConSERT)

ConSERT consists of three components:

A data‑augmentation module that generates two different views of the same sentence at the embedding level.

A shared BERT encoder that produces sentence vectors for each view.

A contrastive loss layer (NT‑Xent) that maximizes similarity between the two views while separating different sentences.

During training, each sentence in a batch is augmented twice, yielding 2N samples that are encoded, pooled, and fed to the NT‑Xent loss with a temperature hyper‑parameter (set to 0.1 in experiments).

3.3 Embedding‑Level Data Augmentation

Four efficient augmentation strategies are explored, all operating on the embedding matrix:

Adversarial Attack : adds gradient‑based perturbations (requires supervised training).

Token Shuffling : shuffles position IDs, disrupting token order.

Cutoff : either zeroes entire token rows (Token Cutoff) or specific feature columns (Feature Cutoff).

Dropout : randomly zeroes individual embedding elements.

These methods avoid costly text‑level generation and directly modify the representations.

3.4 Incorporating Supervised Signals

Three strategies combine supervised NLI loss with the unsupervised contrastive loss:

Joint training (weighted sum of both losses).

Supervised‑then‑unsupervised (train on NLI first, then fine‑tune with contrastive loss).

Joint‑then‑unsupervised (joint training followed by a pure contrastive stage). The joint‑unsupervised variant yields the best results.

4. Experiments

4.1 Unsupervised Setting

Fine‑tuning BERT on unlabeled STS data with ConSERT outperforms the previous SOTA BERT‑flow by ~8% relative improvement.

4.2 Supervised Setting

Adding labeled NLI data (SNLI, MNLI) further boosts performance; the joint‑unsupervised approach achieves the highest scores across both supervised and semi‑supervised configurations.

4.3 Augmentation Ablation

Combining Token Shuffle with Feature Cutoff yields the best performance (72.74). Individually, Token Shuffle > Token Cutoff ≈ Feature Cutoff > Dropout > None.

4.4 Low‑Resource Regime

ConSERT maintains strong performance even with as few as 100 unlabeled sentences, demonstrating robustness in few‑shot scenarios.

4.5 Temperature Sensitivity

The temperature parameter critically affects results; values between 0.08 and 0.12 provide optimal performance.

4.6 Batch Size Influence

Larger batch sizes generally improve results, though gains diminish beyond a certain point.

5. Conclusion

The paper analyzes the collapse problem of BERT sentence embeddings and proposes ConSERT, a contrastive self‑supervised framework that substantially improves sentence representations in both unsupervised and supervised fine‑tuning settings, especially when training data are scarce. The code is open‑sourced on GitHub and has been deployed in Meituan’s knowledge graph, KBQA, and search retrieval systems.

contrastive learningself-supervisedsentence embeddingsBERTsemantic similarity
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.