Artificial Intelligence 34 min read

A Survey of Transfer Learning and Model Pre‑training Techniques for Natural Language Processing

This article reviews the taxonomy of transfer learning in NLP, summarizes representative pre‑training models such as ELMo, ULMFiT, BERT, GPT, MASS and UNILM, discusses their strengths and limitations, and provides practical recommendations for applying these techniques in real‑world projects.

High Availability Architecture
High Availability Architecture
High Availability Architecture
A Survey of Transfer Learning and Model Pre‑training Techniques for Natural Language Processing

The rapid progress of NLP in 2018 was driven by large‑scale unsupervised pre‑training, which turned the field from a data‑starved discipline into one where massive text corpora can be leveraged to learn universal language representations. The author first outlines a taxonomy of transfer learning based on whether the target task T has labeled data and whether the source task S is supervised or unsupervised, resulting in four categories: (1) unsupervised‑only transfer (rarely used in industry), (2) supervised target transfer with self‑supervised, multi‑task, sequential unsupervised pre‑training, and (3) supervised source transfer.

Among these, the sequential unsupervised pre‑training paradigm (e.g., BERT, ELMo, ULMFiT) is identified as the most promising because it can exploit the virtually unlimited amount of raw text. Self‑supervised learning (CVT) also shows benefits, especially for multi‑task settings, but its impact is limited compared to full‑scale pre‑training.

The article then surveys several representative works:

CoVe : a supervised translation‑based encoder that provides contextual word vectors via a BiLSTM encoder trained on parallel data.

CVT (Cross‑View Training) : combines supervised training with auxiliary self‑supervised modules that consume partial inputs, improving stability and multi‑task performance.

ELMo : uses a two‑layer bi‑LSTM language model (biLM) trained on unsupervised data, then fine‑tunes on downstream tasks; it works as a feature extractor.

ULMFiT & SiATL : adopt a three‑step pipeline (unsupervised LM pre‑training, task‑specific LM fine‑tuning, classifier fine‑tuning) with strategies such as discriminative learning rates, gradual un‑freezing and slanted triangular learning rates. SiATL merges the second and third steps and adds an auxiliary LM loss, proving especially effective when only 1%~10% of labeled data are available.

GPT / GPT‑2 : introduce Transformer‑based unidirectional language models; GPT‑2 scales the model to 48 layers (≈1.5 B parameters) and demonstrates strong zero‑shot performance on many tasks.

BERT : employs a 12‑ or 24‑layer Transformer encoder trained with Masked LM (masking k% of tokens) and Next Sentence Prediction (NSP), establishing a new state‑of‑the‑art across 11 NLP benchmarks.

MASS : extends BERT to sequence‑to‑sequence tasks by masking a contiguous span of k tokens and training an encoder‑decoder Transformer; when k=1 it reduces to BERT, and when k equals the sentence length it becomes GPT.

UNILM : jointly trains bidirectional, unidirectional and seq2seq language modeling objectives within a single Transformer using a unified mask mechanism.

MT‑DNN : combines BERT with multi‑task learning, sharing the BERT encoder across several GLUE tasks and achieving further gains.

Practical advice emphasizes that for Chinese NLU tasks BERT is the most convenient choice due to publicly available pretrained models, while ELMo and ULMFiT require custom pre‑training. For classification with a few thousand labeled examples, fine‑tuning BERT directly is recommended; for larger datasets or latency‑sensitive applications, extracting BERT embeddings and training a lightweight classifier (or using a Siamese architecture) can reduce inference time.

The author concludes that future NLP research will likely focus on optimizing the pre‑training pipeline (larger models, better objectives, more diverse unsupervised data) and on integrating structured knowledge into pretrained models, while noting that current successes suggest much of linguistic knowledge can be captured in non‑explicit, black‑box representations.

NLPtransfer learningBERTmultitask learningELMomodel pretraining
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.