Understanding BERT: Architecture, Pre‑training, Fine‑tuning and Applications in Modern NLP
This article provides a comprehensive overview of BERT and related NLP advances, covering its historical context, model architecture, input‑output mechanisms, comparisons with CNNs, word‑embedding evolution, pre‑training strategies like MLM and next‑sentence prediction, and practical guidance for fine‑tuning and feature extraction.
1. Introduction
2018 marked a turning point for machine‑learning models that process text, with rapid progress in representing words and sentences to capture semantics and relationships. The NLP community released powerful, freely available components that can be plugged into pipelines, akin to an "ImageNet moment" for NLP.
ULM‑FiT and other breakthroughs paved the way, and the release of BERT represented a major milestone that broke several benchmarks and made large‑scale pre‑trained models publicly available.
2. Example: Sentence Classification
The most direct way to use BERT is for sentence classification. The typical architecture adds a classifier on top of the pre‑trained BERT encoder and fine‑tunes only the classifier while keeping BERT largely unchanged.
This approach falls under supervised learning and requires a labeled dataset, such as a spam‑vs‑ham email collection.
Semantic analysis : input a product or movie review, output positive/negative sentiment (e.g., SST dataset).
Fact‑checking : input a sentence, output whether it makes a factual claim (see related video).
3. Model Architecture
BERT is built on the Transformer encoder stack. Two model sizes are described:
BERT BASE – comparable in size to the original OpenAI Transformer.
BERT LARGE – a much larger model that achieves state‑of‑the‑art results.
Both variants consist of multiple encoder (Transformer) blocks, with larger hidden dimensions and more attention heads than the original Transformer paper.
4. Model Input
The first token is the special [CLS] token, which aggregates information for classification. Tokens are fed through successive self‑attention and feed‑forward layers, exactly as in a standard Transformer encoder.
5. Model Output
Each position outputs a hidden vector of size hidden_size (768 for BERT BASE). For classification tasks, only the [CLS] output vector is used as input to a downstream classifier, typically a single‑layer neural network.
6. Comparison with Convolutional Neural Networks
The flow of the [CLS] vector through BERT resembles the pipeline of a CNN where convolutional features are followed by a fully‑connected classifier.
7. The New Era of Word Embeddings
7.1 Review of Word Embeddings
Traditional methods like Word2Vec and GloVe provide static vectors that capture word semantics but ignore context.
7.2 ELMo: Contextual Issue
ELMo introduces contextualized embeddings by training a bidirectional LSTM on a language‑modeling objective, allowing the same word to have different vectors depending on its surrounding sentence.
7.3 ULM‑FiT: Transfer Learning in NLP
ULM‑FiT extends the idea of transfer learning to NLP, providing a recipe for pre‑training a language model and fine‑tuning it on downstream tasks.
7.4 Transformer: Beyond LSTM
Transformers replace LSTMs, handling long‑range dependencies more effectively via self‑attention.
7.5 OpenAI Transformer: Pre‑training a Transformer Decoder for Language Modeling
The OpenAI Transformer stacks decoder layers only, training on massive unlabeled text (e.g., 7,000 books) to predict the next token.
7.6 Transfer Learning for Downstream Tasks
After pre‑training, the model can be adapted to tasks such as email spam classification by adding a task‑specific head.
8. BERT: From Decoder to Encoder
8.1 Masked Language Model (MLM)
BERT masks 15% of input tokens and trains the encoder to predict them, enabling bidirectional context learning.
8.2 Next Sentence Prediction Task
During pre‑training, BERT also learns to predict whether two sentences appear consecutively in the original text.
8.3 BERT Applications on Various Tasks
The original paper demonstrates BERT’s strong performance on a wide range of NLP benchmarks.
8.4 Using BERT for Feature Extraction
Beyond fine‑tuning, BERT can be used to generate contextual embeddings for downstream models, achieving results comparable to full fine‑tuning on tasks like named‑entity recognition.
8.5 How to Use BERT
The recommended way to experiment with BERT is via the BERT Fine‑Tuning with Cloud TPUs notebook on Google Colab, which runs on TPU, CPU, or GPU.
Key code files include:
modeling.py – defines the class BertModel (identical to a standard Transformer encoder).
run_classifier.py – provides a fine‑tuning example and contains the create_model() method for building a custom classifier.
Pre‑trained checkpoints for BERT Base, BERT Large, and multilingual models are publicly available.
Tokenization is handled by tokenization.py , which converts words into WordPiece tokens.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
