Artificial Intelligence 17 min read

Understanding BERT: Architecture, Pre‑training, Fine‑tuning and Applications in Modern NLP

This article provides a comprehensive overview of BERT and related NLP advances, covering its historical context, model architecture, input‑output mechanisms, comparisons with CNNs, word‑embedding evolution, pre‑training strategies like MLM and next‑sentence prediction, and practical guidance for fine‑tuning and feature extraction.

Sohu Tech Products

Nov 4, 2020

Understanding BERT: Architecture, Pre‑training, Fine‑tuning and Applications in Modern NLP

1. Introduction

2018 marked a turning point for machine‑learning models that process text, with rapid progress in representing words and sentences to capture semantics and relationships. The NLP community released powerful, freely available components that can be plugged into pipelines, akin to an "ImageNet moment" for NLP.

ULM‑FiT and other breakthroughs paved the way, and the release of BERT represented a major milestone that broke several benchmarks and made large‑scale pre‑trained models publicly available.

2. Example: Sentence Classification

The most direct way to use BERT is for sentence classification. The typical architecture adds a classifier on top of the pre‑trained BERT encoder and fine‑tunes only the classifier while keeping BERT largely unchanged.

This approach falls under supervised learning and requires a labeled dataset, such as a spam‑vs‑ham email collection.

Semantic analysis : input a product or movie review, output positive/negative sentiment (e.g., SST dataset).

Fact‑checking : input a sentence, output whether it makes a factual claim (see related video).

3. Model Architecture

BERT is built on the Transformer encoder stack. Two model sizes are described:

BERT BASE – comparable in size to the original OpenAI Transformer.

BERT LARGE – a much larger model that achieves state‑of‑the‑art results.

Both variants consist of multiple encoder (Transformer) blocks, with larger hidden dimensions and more attention heads than the original Transformer paper.

4. Model Input

The first token is the special [CLS] token, which aggregates information for classification. Tokens are fed through successive self‑attention and feed‑forward layers, exactly as in a standard Transformer encoder.

5. Model Output

Each position outputs a hidden vector of size hidden_size (768 for BERT BASE). For classification tasks, only the [CLS] output vector is used as input to a downstream classifier, typically a single‑layer neural network.

6. Comparison with Convolutional Neural Networks

The flow of the [CLS] vector through BERT resembles the pipeline of a CNN where convolutional features are followed by a fully‑connected classifier.

7. The New Era of Word Embeddings

7.1 Review of Word Embeddings

Traditional methods like Word2Vec and GloVe provide static vectors that capture word semantics but ignore context.

7.2 ELMo: Contextual Issue

ELMo introduces contextualized embeddings by training a bidirectional LSTM on a language‑modeling objective, allowing the same word to have different vectors depending on its surrounding sentence.

7.3 ULM‑FiT: Transfer Learning in NLP

ULM‑FiT extends the idea of transfer learning to NLP, providing a recipe for pre‑training a language model and fine‑tuning it on downstream tasks.

7.4 Transformer: Beyond LSTM

Transformers replace LSTMs, handling long‑range dependencies more effectively via self‑attention.

7.5 OpenAI Transformer: Pre‑training a Transformer Decoder for Language Modeling

The OpenAI Transformer stacks decoder layers only, training on massive unlabeled text (e.g., 7,000 books) to predict the next token.

7.6 Transfer Learning for Downstream Tasks

After pre‑training, the model can be adapted to tasks such as email spam classification by adding a task‑specific head.

8. BERT: From Decoder to Encoder

8.1 Masked Language Model (MLM)

BERT masks 15% of input tokens and trains the encoder to predict them, enabling bidirectional context learning.

8.2 Next Sentence Prediction Task

During pre‑training, BERT also learns to predict whether two sentences appear consecutively in the original text.

8.3 BERT Applications on Various Tasks

The original paper demonstrates BERT’s strong performance on a wide range of NLP benchmarks.

8.4 Using BERT for Feature Extraction

Beyond fine‑tuning, BERT can be used to generate contextual embeddings for downstream models, achieving results comparable to full fine‑tuning on tasks like named‑entity recognition.

8.5 How to Use BERT

The recommended way to experiment with BERT is via the BERT Fine‑Tuning with Cloud TPUs notebook on Google Colab, which runs on TPU, CPU, or GPU.

Key code files include:

modeling.py – defines the class BertModel (identical to a standard Transformer encoder).

run_classifier.py – provides a fine‑tuning example and contains the create_model() method for building a custom classifier.

Pre‑trained checkpoints for BERT Base, BERT Large, and multilingual models are publicly available.

Tokenization is handled by tokenization.py , which converts words into WordPiece tokens.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer fine-tuning NLP pretraining BERT

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.