Tracing the Evolution of Language Models: From N‑grams to GPT‑2

This article reviews the historical development of natural language processing language models, covering expert rule‑based systems, statistical n‑grams, smoothing techniques, neural network models such as NNLM, RNN, word2vec, GloVe, ELMo, and the transformer‑based breakthroughs of GPT, BERT and GPT‑2, and summarizes their impact on modern NLP tasks.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Tracing the Evolution of Language Models: From N‑grams to GPT‑2

Language Model Overview

Language models essentially answer the question: is a given sentence reasonable? Historically, language models have progressed from expert grammar rule models (up to the 1980s), to statistical language models (around 2000), and finally to neural network language models (present).

Statistical Language Models

Statistical language models predict sentence probability using large corpora. They suffer from huge parameter spaces and data sparsity, leading to the use of n‑gram models.

n‑gram

n‑gram models estimate the probability of a word given the previous n‑1 words. Common variants include unigram, bigram, and trigram. While simple and interpretable, they struggle with long‑range dependencies and sparsity as n grows.

Smoothing

Smoothing addresses zero‑probability issues caused by data sparsity. Common methods include Laplace (add‑one) smoothing, additive smoothing (generalized Laplace), Good‑Turing smoothing, and others.

Neural Network Language Models (2003)

Neural network language models (NNLM) replace discrete word representations with continuous word embeddings, reducing dimensionality and capturing similarity between words. They compute conditional probabilities via feed‑forward or recurrent networks.

RNN Language Models (2010)

RNNLMs capture longer context by maintaining a hidden state that summarizes all previous words, improving over n‑gram models but still facing gradient vanishing problems.

Word2Vec (2013) – CBOW & Skip‑gram

Word2Vec learns word embeddings using shallow neural networks. CBOW predicts a target word from its context, while Skip‑gram predicts surrounding words from a target word. Training optimizations include hierarchical softmax and negative sampling.

Hierarchical Softmax

Hierarchical softmax replaces the full softmax with a Huffman tree, reducing computation by updating only nodes along the path to the target word.

Negative Sampling

Negative sampling approximates the softmax by updating a small set of negative examples per training step, dramatically lowering computational cost.

Word Vector Fine‑tuning

Experiments show that fine‑tuning pretrained embeddings (CNN‑non‑static) generally outperforms keeping them static, while a multichannel approach can further improve performance on small datasets.

GloVe (2014)

GloVe combines matrix factorization and sliding‑window approaches by factorizing a global co‑occurrence matrix, learning word vectors that capture both global statistics and local context.

ELMo (2018)

ELMo generates context‑dependent word representations using a bidirectional two‑layer LSTM trained on a language modeling objective, allowing the same word to have different vectors in different contexts.

GPT (Generative Pre‑Training) (2018)

GPT is a unidirectional transformer‑based language model pre‑trained on large corpora and fine‑tuned for downstream tasks. It uses only left‑to‑right context.

BERT (2018)

BERT improves on GPT by using a bidirectional transformer and two pre‑training tasks: masked word prediction and next‑sentence prediction. This yields superior performance on many NLP benchmarks.

GPT‑2 (2019)

GPT‑2 scales up GPT with higher‑quality, larger, and more diverse data, a 1.5‑billion‑parameter model, and architectural tweaks (layer‑norm placement, initialization). It demonstrates that language models are powerful unsupervised multitask learners.

Summary

Comparisons on the MSRA dataset show that BERT‑based models significantly outperform traditional LSTM‑CRF models for slot filling (F1 = 87.5%). For intent classification, BERT‑FST achieves 71.7% F1. Ongoing work explores joint multi‑task training with BERT.

About Us

We are the Ant Financial Wealth Dialogue Algorithm Team, focusing on cutting‑edge algorithms for intelligent dialogue systems. We are hiring NLP, recommendation, and user profiling experts (P6‑P9). Contact: [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningTransformerNLPBERTGPTlanguage models
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.