Tracing the Evolution of Language Models: From N‑grams to GPT‑2
This article reviews the historical development of natural language processing language models, covering expert rule‑based systems, statistical n‑grams, smoothing techniques, neural network models such as NNLM, RNN, word2vec, GloVe, ELMo, and the transformer‑based breakthroughs of GPT, BERT and GPT‑2, and summarizes their impact on modern NLP tasks.
Language Model Overview
Language models essentially answer the question: is a given sentence reasonable? Historically, language models have progressed from expert grammar rule models (up to the 1980s), to statistical language models (around 2000), and finally to neural network language models (present).
Statistical Language Models
Statistical language models predict sentence probability using large corpora. They suffer from huge parameter spaces and data sparsity, leading to the use of n‑gram models.
n‑gram
n‑gram models estimate the probability of a word given the previous n‑1 words. Common variants include unigram, bigram, and trigram. While simple and interpretable, they struggle with long‑range dependencies and sparsity as n grows.
Smoothing
Smoothing addresses zero‑probability issues caused by data sparsity. Common methods include Laplace (add‑one) smoothing, additive smoothing (generalized Laplace), Good‑Turing smoothing, and others.
Neural Network Language Models (2003)
Neural network language models (NNLM) replace discrete word representations with continuous word embeddings, reducing dimensionality and capturing similarity between words. They compute conditional probabilities via feed‑forward or recurrent networks.
RNN Language Models (2010)
RNNLMs capture longer context by maintaining a hidden state that summarizes all previous words, improving over n‑gram models but still facing gradient vanishing problems.
Word2Vec (2013) – CBOW & Skip‑gram
Word2Vec learns word embeddings using shallow neural networks. CBOW predicts a target word from its context, while Skip‑gram predicts surrounding words from a target word. Training optimizations include hierarchical softmax and negative sampling.
Hierarchical Softmax
Hierarchical softmax replaces the full softmax with a Huffman tree, reducing computation by updating only nodes along the path to the target word.
Negative Sampling
Negative sampling approximates the softmax by updating a small set of negative examples per training step, dramatically lowering computational cost.
Word Vector Fine‑tuning
Experiments show that fine‑tuning pretrained embeddings (CNN‑non‑static) generally outperforms keeping them static, while a multichannel approach can further improve performance on small datasets.
GloVe (2014)
GloVe combines matrix factorization and sliding‑window approaches by factorizing a global co‑occurrence matrix, learning word vectors that capture both global statistics and local context.
ELMo (2018)
ELMo generates context‑dependent word representations using a bidirectional two‑layer LSTM trained on a language modeling objective, allowing the same word to have different vectors in different contexts.
GPT (Generative Pre‑Training) (2018)
GPT is a unidirectional transformer‑based language model pre‑trained on large corpora and fine‑tuned for downstream tasks. It uses only left‑to‑right context.
BERT (2018)
BERT improves on GPT by using a bidirectional transformer and two pre‑training tasks: masked word prediction and next‑sentence prediction. This yields superior performance on many NLP benchmarks.
GPT‑2 (2019)
GPT‑2 scales up GPT with higher‑quality, larger, and more diverse data, a 1.5‑billion‑parameter model, and architectural tweaks (layer‑norm placement, initialization). It demonstrates that language models are powerful unsupervised multitask learners.
Summary
Comparisons on the MSRA dataset show that BERT‑based models significantly outperform traditional LSTM‑CRF models for slot filling (F1 = 87.5%). For intent classification, BERT‑FST achieves 71.7% F1. Ongoing work explores joint multi‑task training with BERT.
About Us
We are the Ant Financial Wealth Dialogue Algorithm Team, focusing on cutting‑edge algorithms for intelligent dialogue systems. We are hiring NLP, recommendation, and user profiling experts (P6‑P9). Contact: [email protected].
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
