LRC-BERT: Contrastive Learning based Knowledge Distillation with COS‑NCE Loss for Efficient NLP Models

The Amap team introduced LRC‑BERT, a contrastive‑learning‑based knowledge‑distillation framework that employs a novel COS‑NCE loss, gradient‑perturbation, and a two‑stage training schedule, enabling a 4‑layer student model to retain about 97 % of BERT‑Base accuracy while being 7.5× smaller and 9.6× faster, and it has already improved real‑world traffic‑event extraction performance.

Amap Tech
Amap Tech
Amap Tech
LRC-BERT: Contrastive Learning based Knowledge Distillation with COS‑NCE Loss for Efficient NLP Models

The Amap Intelligent Technology Center R&D team designed a contrastive learning framework for knowledge distillation and proposed a novel COS‑NCE loss, which was accepted by the AAAI 2021 conference.

Natural Language Processing (NLP) is crucial for many Amap services such as real‑time event naming, semantic understanding in search, and automatic responsibility attribution in shared‑ride calls. Recent breakthroughs in NLP are driven by large pre‑trained models; BERT, for example, dominates benchmarks and achieves state‑of‑the‑art performance on 11 NLP tasks.

However, BERT’s billions of parameters lead to inference latency of hundreds of milliseconds per sample, making deployment costly. Model compression, especially knowledge distillation, is a key solution. Hinton’s seminal work introduced the teacher‑student paradigm, and subsequent methods (BERT‑PKD, DistilBERT, TinyBERT) mainly focus on loss design while reducing transformer depth or hidden size.

The core challenge of knowledge distillation is capturing the teacher’s latent semantic information. Existing loss functions focus on individual sample details and fail to capture deeper semantics.

Our contributions are:

We introduce a contrastive learning framework for knowledge distillation and propose the COS‑NCE loss, which pulls positive samples together and pushes negative samples apart in the angular (cosine) space, enabling effective learning of latent semantics.

We incorporate gradient‑perturbation into the distillation process, improving model robustness.

We design a two‑stage training procedure: stage 1 optimizes only the contrastive loss (α:β:γ = 1:0:0) to focus on intermediate representations; stage 2 adds soft‑loss and hard‑loss (α:β:γ = 1:1:3) for downstream task performance.

Problem Definition

Teacher network: f_T(x,θ) → Z_T; Student network: f_S(x,θ′) → Z_S. The goal is to make Z_S close to Z_T while minimizing prediction‑layer loss.

Distillation Structure

We apply COS‑NCE to intermediate transformer layers. For a given teacher f_T and student f_S, a positive sample and K negative samples N={n₁⁻,…,n_K⁻} are selected. The loss minimizes the angular distance between Z_S and Z_T and maximizes the distance between Z_S and each negative sample.

The cosine‑based distance g(x,y)∈[0,2] is used, and the final COS‑NCE term is 2‑g(n_i⁻,Z_S)‑g(Z_T,Z_S).

Distillation for Transformer Layers

We map teacher’s N transformer layers to student’s M layers uniformly. For each layer i, student output h_i^S∈ℝ^{l×d} is aligned with teacher output h_i^T∈ℝ^{l×d′} using a linear projection W∈ℝ^{d×d′} to match hidden sizes.

Distillation for Prediction Layer

Student’s prediction head learns the teacher’s soft logits (KL‑divergence) and the ground‑truth hard labels (cross‑entropy). The overall loss is L = α·L_{COS‑NCE} + β·L_{soft} + γ·L_{hard}.

Training with Gradient Perturbation

Instead of back‑propagating the total loss directly, we compute the gradient of the total loss w.r.t. the embedding of the student (∇L_{total}(emb_S)), perturb the embedding, and then update the model parameters. This enhances robustness.

Experiments

Datasets: GLUE benchmark (9 tasks). Teacher: BERT‑Base (12 layers, 768 hidden size). Student: 4‑layer transformer (312 hidden size). Two model variants: LRC‑BERT (pre‑training + task‑specific distillation) and LRC‑BERT1 (direct task‑specific distillation).

Training settings: learning rates {5e‑5, 1e‑4, 3e‑4}, batch size 16, 90 epochs for small datasets, 18 epochs for larger ones. Two‑stage schedule: first 80% steps α:β:γ = 1:0:0, last 20% steps α:β:γ = 1:1:3, temperature t = 1.1.

Results: LRC‑BERT outperforms DistilBERT, BERT‑PKD, TinyBERT, retaining 97.4% of BERT‑Base performance on average. On large datasets, LRC‑BERT1 improves MNLI‑m, MNLI‑mm, QQP, QNLI by 0.3‑0.8%. On MRPC, RTE, CoLA, LRC‑BERT gains 4‑14.9% over LRC‑BERT1. Inference speed is 9.6× faster and model size is 7.5× smaller.

Ablation studies show that removing COS‑NCE degrades performance most severely, especially on CoLA (score drops from 50 to 37). Gradient perturbation and the two‑stage training both contribute positively to robustness and final accuracy.

Real‑World Deployment

The method has been deployed in Amap’s traffic‑event extraction pipeline, achieving a 4% accuracy and 3% recall improvement on weekdays, and 5% accuracy and 7% recall improvement on holidays, while preserving 97% of BERT‑Base performance and reducing inference cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

contrastive learningmodel compressionNLPBERTCOS-NCE lossgradient perturbation
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.