Semi‑Supervised Training Methods for Transformers
This article explains an end‑to‑end semi‑supervised training pipeline for Transformer‑based NLP models, detailing the unsupervised language‑model pre‑training, supervised fine‑tuning, and the internal architecture of embeddings, encoder layers, and downstream tasks such as text classification and NER.
Two‑Step Semi‑Supervised Training
First, an unsupervised stage trains a language model (e.g., fine‑tune BERT or GPT) on a large corpus of unlabeled text. This adjusts the model weights to capture contextual information across domains such as fashion, news, or sports.
Second, a supervised stage trains the downstream task with a smaller labeled dataset. When the unsupervised corpus is large, high accuracy is achieved even with limited labeled data.
Embedding Layer
The embedding layer converts raw text into a model‑readable matrix and consists of three components:
Word embeddings : BERT’s vocabulary contains 30,522 tokens; each token is represented by a 768‑dimensional vector. For a four‑word input the embedding output is 4 × 768.
Position embeddings : Encode token order because Transformer attention depends on position.
Linear layer : A linear projection that produces a final matrix with 768 features.
Encoder Layer
The encoder comprises 11 BERT layers. Each layer contains two main sub‑components:
Self‑attention layer : Computes queries (Q), keys (K) and values (V) via convolution, dot‑product and softmax, yielding a 768‑dimensional output. Multi‑head attention operates on the same 768‑dimensional vectors.
Feed‑forward (convolution) layer : Includes dense layers, layer‑norm, dropout and GELU activation, preserving the 768‑dimensional feature size.
Context Vector
The outputs of the embedding and encoder layers form a 768‑dimensional context vector that represents the sentence meaning. This vector is the primary output of the unsupervised stage and serves as input to downstream classifiers or generators.
Language‑Modeling Objectives
Masked Language Model (MLM)
10‑30% of tokens are masked and the model predicts the missing words. Example input: "women Floral print [MASK] top". The MLM head adds a convolutional layer on top of the encoder to predict the masked token, producing a 30,522‑dimensional output vector (the vocabulary size). Training on 20 M fashion sentences shows a noticeable improvement over the original BERT (illustrated by a GIF).
Casual Language Model (CLM)
CLM predicts the next token given the previous ones and is typically used for generative models such as GPT. A GIF compares a fashion‑fine‑tuned GPT‑2 with the original GPT‑2, showing clearer generation for the domain‑specific model.
Downstream Tasks
Text Classification
A dense layer reduces the 768‑dimensional context vector to the number of classes. Better context vectors lead to more robust and accurate classifiers. Word‑level confidence scores indicate each token’s contribution to the predicted class.
Example sentence "pantaloons green shirt with jet, wear it with black jeans" correctly predicts the "shirt" class, with the token "shirt" receiving the highest confidence. Changing the ground‑truth label to "jeans" flips the confidence scores accordingly.
Named Entity Recognition (NER)
The same context vector is fed to a NER head that predicts a label for each token. Because the data is fashion‑specific, the entity labels are fine‑grained, supporting phrase discovery and trend analysis.
Resources
Training notebooks and example code are hosted on Hugging Face and Google Colab:
Transformer notebooks: https://huggingface.co/docs/transformers/notebooks Fine‑tune language model:
https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynbText classification:
https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynbNER model:
https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/token_classification.ipynbSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
