Semi‑Supervised Training Methods for Transformers

This article explains an end‑to‑end semi‑supervised training pipeline for Transformer‑based NLP models, detailing the unsupervised language‑model pre‑training, supervised fine‑tuning, and the internal architecture of embeddings, encoder layers, and downstream tasks such as text classification and NER.

Code DAO
Code DAO
Code DAO
Semi‑Supervised Training Methods for Transformers

Two‑Step Semi‑Supervised Training

First, an unsupervised stage trains a language model (e.g., fine‑tune BERT or GPT) on a large corpus of unlabeled text. This adjusts the model weights to capture contextual information across domains such as fashion, news, or sports.

Second, a supervised stage trains the downstream task with a smaller labeled dataset. When the unsupervised corpus is large, high accuracy is achieved even with limited labeled data.

Embedding Layer

The embedding layer converts raw text into a model‑readable matrix and consists of three components:

Word embeddings : BERT’s vocabulary contains 30,522 tokens; each token is represented by a 768‑dimensional vector. For a four‑word input the embedding output is 4 × 768.

Position embeddings : Encode token order because Transformer attention depends on position.

Linear layer : A linear projection that produces a final matrix with 768 features.

Encoder Layer

The encoder comprises 11 BERT layers. Each layer contains two main sub‑components:

Self‑attention layer : Computes queries (Q), keys (K) and values (V) via convolution, dot‑product and softmax, yielding a 768‑dimensional output. Multi‑head attention operates on the same 768‑dimensional vectors.

Feed‑forward (convolution) layer : Includes dense layers, layer‑norm, dropout and GELU activation, preserving the 768‑dimensional feature size.

Multi‑head attention diagram
Multi‑head attention diagram

Context Vector

The outputs of the embedding and encoder layers form a 768‑dimensional context vector that represents the sentence meaning. This vector is the primary output of the unsupervised stage and serves as input to downstream classifiers or generators.

Language‑Modeling Objectives

Masked Language Model (MLM)

10‑30% of tokens are masked and the model predicts the missing words. Example input: "women Floral print [MASK] top". The MLM head adds a convolutional layer on top of the encoder to predict the masked token, producing a 30,522‑dimensional output vector (the vocabulary size). Training on 20 M fashion sentences shows a noticeable improvement over the original BERT (illustrated by a GIF).

MLM input example
MLM input example

Casual Language Model (CLM)

CLM predicts the next token given the previous ones and is typically used for generative models such as GPT. A GIF compares a fashion‑fine‑tuned GPT‑2 with the original GPT‑2, showing clearer generation for the domain‑specific model.

GPT‑2 fine‑tuning comparison
GPT‑2 fine‑tuning comparison

Downstream Tasks

Text Classification

A dense layer reduces the 768‑dimensional context vector to the number of classes. Better context vectors lead to more robust and accurate classifiers. Word‑level confidence scores indicate each token’s contribution to the predicted class.

Example sentence "pantaloons green shirt with jet, wear it with black jeans" correctly predicts the "shirt" class, with the token "shirt" receiving the highest confidence. Changing the ground‑truth label to "jeans" flips the confidence scores accordingly.

Classification with context vector
Classification with context vector

Named Entity Recognition (NER)

The same context vector is fed to a NER head that predicts a label for each token. Because the data is fashion‑specific, the entity labels are fine‑grained, supporting phrase discovery and trend analysis.

NER with context vector
NER with context vector

Resources

Training notebooks and example code are hosted on Hugging Face and Google Colab:

Transformer notebooks: https://huggingface.co/docs/transformers/notebooks Fine‑tune language model:

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb

Text classification:

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb

NER model:

https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynb
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

fine-tuningNLPtransformersBERTSemi-supervised LearningMasked Language Model
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.