Artificial Intelligence 10 min read

How BERT, GPT, and ELMo Revolutionize Language Feature Representation

Natural language processing, a cornerstone of AI, relies on language models to capture linguistic features; this article reviews classic pre‑training models—ELMo, GPT, and BERT—explaining their architectures, training objectives, and how they boost downstream NLP tasks despite data‑scarcity challenges.

Hulu Beijing

Apr 4, 2019

How BERT, GPT, and ELMo Revolutionize Language Feature Representation

Introduction

Natural language processing (NLP) holds a pivotal role in artificial intelligence; the ability to understand and generate human language was highlighted as a core component of the Turing Test proposed in 1950. Both academia and industry have produced significant advances, from Google Search to Apple Siri and Microsoft XiaoIce, all benefiting from NLP research.

NLP tasks are divided into core tasks—such as language modeling, morphology, syntactic parsing, and semantic analysis—and application tasks like machine translation, information retrieval, question answering, and dialogue systems. Recent breakthroughs in deep learning have propelled NLP forward, yet deep models demand massive, expensive labeled data. Transfer learning approaches, exemplified by the 2018 BERT model, mitigate data scarcity by pre‑training on large corpora.

Problem

What is the task form of language models?

How do language models help improve the performance of other NLP tasks?

Analysis and Answers

Language models are central to NLP because they estimate the probability of generating a sequence of words, enabling the production of human‑like text. Since training does not require external supervision, language models can learn universal semantic representations. The prevailing approach pre‑trains a neural network on massive unlabeled text and fine‑tunes it for specific downstream tasks. Representative models include ELMo, GPT, and BERT.

ELMo, GPT, and BERT share a similar high‑level idea: a large pre‑trained model provides contextual embeddings that can be adapted to downstream tasks. Figure 1 illustrates their architectures, where the yellow input layer represents word or sentence embeddings, the blue middle layers denote the core network, and the green output layer corresponds to predicted words or classification labels.

GPT is based on a unidirectional Transformer encoder. It uses a forward language modeling objective during pre‑training, and the learned parameters serve as initialization for various tasks such as classification, sequence labeling, and sentence‑pair judgment. By treating the language model as an auxiliary task, GPT enhances generalization and accelerates convergence.

BERT, introduced by Google in late 2018, achieved state‑of‑the‑art results on 11 NLP benchmarks. Its innovations are twofold:

True bidirectional modeling via a masked language model (MLM) that randomly masks 15% of tokens and predicts them, allowing the model to incorporate both left and right context without leaking information.

Explicit sentence‑level relationship modeling by adding a special [CLS] token for classification and a [SEP] token to separate sentences, enabling tasks like next‑sentence prediction.

During fine‑tuning, BERT retains its pre‑trained architecture and adapts the [CLS] embedding for downstream classification, often requiring fewer architectural changes than GPT and delivering stronger generalization.

In summary, while ELMo, GPT, and BERT share the pre‑training‑then‑fine‑tuning paradigm, they differ in directionality, training objectives, and sentence‑level modeling, leading to distinct strengths for various NLP applications.

References

DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6): 391–407.

PENNINGTON J, SOCHER R, MANNING C. GloVe: Global vectors for word representation. EMNLP, 2014: 1532–1543.

MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality. NIPS, 2013: 3111–3119.

JOULIN A, GRAVE E, BOJANOWSKI P, et al. Bag of Tricks for Efficient Text Classification. EMNLP, 2017: 427–431.

BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model. JMLR, 2003, 3(Feb): 1137–1155.

PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.

RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre‑training. 2018.

DEVLIN J, CHANG M‑W, LEE K, et al. BERT: Pre‑training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning NLP pretraining BERT GPT language models ELMo

Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.