From Word Embedding to BERT: A Comprehensive Overview of Pre‑training Model Development in NLP
This article surveys the evolution of pre‑training models for natural language processing, detailing model architectures such as Encoder‑AE, Decoder‑AR, Encoder‑Decoder, Prefix LM, and PLM, analyzing why models like RoBERTa, T5, and GPT‑3 excel, and offering practical guidance for building strong pre‑training systems.
The article begins with an introduction to the rapid development of BERT and its successors since October 2018, noting the proliferation of pre‑training models (PTMs) and their dominance on NLP benchmarks, often surpassing human performance.
It then outlines the fundamental goal of pre‑training: using a Transformer as a feature extractor, applying a self‑supervised task to force the model to learn language knowledge from massive unlabeled text, storing this knowledge in model parameters for downstream tasks.
The discussion proceeds to compare common model structures. Five architectures are described: (1) Encoder‑AE (e.g., BERT), which uses a bidirectional masked language model; (2) Decoder‑AR (e.g., GPT series), a left‑to‑right autoregressive model; (3) Encoder‑Decoder, which combines both directions for unified understanding and generation (e.g., T5, BART); (4) Prefix LM, a variant that shares a single Transformer between encoder and decoder via attention masks; and (5) Permuted Language Model (PLM), a hybrid approach used in XLNet.
Experimental evidence shows that Encoder‑Decoder structures achieve the best performance on both understanding and generation tasks, though their advantage may stem from larger parameter counts. For lighter models, Encoder‑AE excels on understanding tasks, while Decoder‑AR and Prefix LM are preferable for generation.
The article then examines why certain models outperform others, identifying four key factors: larger high‑quality pre‑training data, increased model capacity, more thorough training (larger batch size and longer steps), and more challenging pre‑training objectives such as span‑masking and sentence‑order prediction.
It highlights the importance of incorporating external knowledge—structured knowledge graphs (e.g., ERNIE) and multimodal data (e.g., text‑image) —and describes typical multimodal pre‑training architectures (dual‑stream and single‑stream Transformers) and their training objectives.
Finally, the author proposes a four‑stage training pipeline: (1) generic large‑scale pre‑training, (2) domain‑specific pre‑training to mitigate catastrophic forgetting, (3) task‑level pre‑training on unlabeled task data, and (4) fine‑tuning on labeled data, recommending appropriate model structures and training strategies for both understanding and generation tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
