Artificial Intelligence 9 min read

How BERT‑to‑TextCNN Knowledge Distillation Boosts Spam Opinion Detection

This article examines how large pretrained BERT models can be compressed via knowledge distillation into a lightweight TextCNN classifier for efficient garbage opinion detection, detailing traditional distillation methods, several practical schemes, experimental results, and the advantages of the approach.

Alibaba Cloud Developer

Jul 16, 2020

How BERT‑to‑TextCNN Knowledge Distillation Boosts Spam Opinion Detection

Introduction

Large‑scale pretrained models such as BERT achieve excellent results in NLP tasks, but their massive parameters cause latency issues that hinder production deployment. In public‑opinion monitoring, a large amount of spam opinions requires efficient detection. This article explores BERT knowledge distillation to improve a TextCNN classifier, leveraging its small and fast nature for successful deployment.

Traditional Distillation Scheme

Model compression and acceleration techniques are generally categorized into four types:

Parameter pruning and sharing

Low‑rank factorization

Shift/compact convolution filters

Knowledge distillation

Knowledge distillation transfers knowledge from a teacher network to a student network so that the student can achieve performance comparable to the teacher. The article focuses on its applications.

1 Soft label

Knowledge distillation was first proposed by Caruana et al. in 2014. By introducing soft labels from a teacher network as part of the overall loss, a simpler student network can learn the teacher’s representations. The teacher’s class predictions contain similarity information among data structures, and the student can converge with few new samples. Raising the temperature in the softmax makes the distribution smoother.

Loss formula:

Advantages of distillation:

Learns the feature representation of large models and captures inter‑class information absent in one‑hot labels.

Provides noise robustness; teacher gradients can correct student gradients under noisy data.

Improves model generalization to some extent.

2 Using hints

FitNets (ICLR 2015) by Romero et al. not only use the teacher’s final logits but also intermediate hidden‑layer activations as hints, enabling the training of deeper and thinner student networks.

Intermediate‑layer loss:

By adding this loss, the student’s solution space is constrained toward the teacher’s parameters, reducing redundancy.

3 Co‑training

Route Constrained Optimization (RCO) (arXiv 2019) draws inspiration from curriculum learning. It addresses the large gap between teacher and student that can cause distillation failure by introducing a routing‑constrained hint learning process, where the teacher is trained first and its outputs guide the student in an easy‑to‑hard manner.

Training path:

BERT‑to‑TextCNN Distillation Schemes

To improve accuracy while meeting latency constraints and limited GPU resources, several schemes were implemented to distill a BERT model into a TextCNN model.

Scheme 1: Offline logit TextCNN distillation

Uses Caruana’s traditional method.

Scheme 2: Joint training with isolated parameters

The teacher is trained once, its logits are passed to the student. The teacher’s parameters are updated by hard labels, while the student’s parameters are updated by a combination of soft‑label loss from the teacher and hard‑label loss.

Scheme 3: Joint training without parameter isolation

Similar to Scheme 2, but the student’s soft‑label gradients also update the teacher, allowing mutual improvement.

Scheme 4: Joint training with loss addition

Both teacher and student are trained simultaneously in a multi‑task fashion, adding their losses together.

Scheme 5: Multi‑teacher

Historical online models serve as teachers, enabling the new model to retain knowledge of previous models and improve overall coverage.

Experimental Results

Results show that Scheme 3 outperforms Scheme 2 because feedback from the student’s soft loss improves the teacher. Scheme 4 leads to rapid degradation of TextCNN performance despite stable BERT performance. Scheme 5, using historical logits, slightly reduces recall but increases overall coverage by 5%.

Reference Dean, J. Distilling the Knowledge in a Neural Network. Romero A, Ballas N, Kahou S E, et al. FitNets: Hints for Thin Deep Nets. Jin X, Peng B, Wu Y, et al. Knowledge Distillation via Route Constrained Optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model compression NLP knowledge distillation BERT TextCNN

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.