Artificial Intelligence 8 min read

Boosting BERT Text Classification with Label Embedding: How It Works

The paper proposes a simple yet effective method that fuses label embeddings into BERT, enhancing text‑classification performance without increasing computational cost, and validates the approach across six benchmark datasets, also exploring tf‑idf‑based label augmentation and the impact of using [SEP] versus no‑[SEP] inputs.

Baobao Algorithm Notes

Jan 14, 2022

Boosting BERT Text Classification with Label Embedding: How It Works

Paper Overview

The authors introduce a concise technique that incorporates label embeddings directly into BERT to improve text‑classification results while keeping the computational overhead virtually unchanged. Experiments on six benchmark datasets demonstrate the method’s effectiveness.

Key Contributions

Both text and label embeddings are learned jointly from the same latent space, eliminating extra intermediate steps.

The approach leverages the inherent interaction between BERT’s label and text embeddings without adding new mechanisms.

Because the original BERT architecture is retained, the method incurs negligible extra computation.

Extensive results on six datasets show that the technique unlocks additional BERT potential for text classification and related downstream tasks.

Model Architecture

Fusing Label Semantics with BERT

Inspired by sentence‑pair inputs, the authors concatenate the label text and the input document with a [SEP] token. Both parts are represented using separate segment embeddings.

For each document, characters are tokenized and embedded; the final character token represents the whole document, while another token records the document length. The number of classes is denoted as C, and each class c has an associated label text (e.g., "world", "sports", "business", "science technology"). Sub‑words of a label are averaged after token embedding.

The combined sequence (label + document) is fed into BERT’s self‑attention layers, producing a representation w/[SEP]. A tanh‑activated linear layer on top of the [CLS] embedding then performs classification via cross‑entropy loss.

Using w/o[SEP] for Sentence‑Pair Input

When the label text and document are concatenated without a [SEP] token, the model is referred to as w/o[SEP]. This variant treats the concatenated string as a single sentence rather than a natural sentence‑pair, which influences the pre‑training‑to‑fine‑tuning alignment.

tf‑idf‑Based Label Augmentation

Beyond embedding the raw label text, the authors experiment with enriching each label by adding high‑scoring sub‑words based on tf‑idf. Using BERT’s WordPiece tokenizer, they compute average tf‑idf scores for sub‑words and select the top 5, 10, 15, or 20 words to supplement the label representation, further boosting performance.

Experimental Setup

Datasets

The evaluation uses several public text‑classification corpora, including AGNEWS (4 classes) and DBpedia (14 classes). During inference, label prefixes are added, which introduces modest overhead; therefore, the number of labels is kept moderate to avoid performance degradation.

Results and Analysis

Results show that the w/o[SEP] configuration consistently outperforms the w/[SEP] variant. The authors attribute this to the mismatch between BERT’s next‑sentence‑prediction pre‑training objective and the artificial concatenation of label and document with [SEP], which can cause a bias that harms fine‑tuning.

t‑SNE visualizations of the learned representations reveal that label‑augmented embeddings form more distinct clusters than the standard [CLS] embeddings, explaining the improved classification capability.

Conclusion

The proposed method seamlessly integrates label embeddings into BERT, delivering notable gains on small‑to‑medium datasets with few classes.

Supplementing label texts with high‑tf‑idf sub‑words provides an additional performance boost.

deep learning NLP BERT TF-IDF text classification label embedding

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.