Illustrated Guide to GPT-2: Detailed Explanation of the Decoder‑Only Transformer Model
This article provides a comprehensive, illustrated walkthrough of OpenAI's GPT‑2 language model, covering its decoder‑only Transformer architecture, self‑attention mechanisms, token processing, training data, differences from BERT, and applications beyond language modeling, enriched with visual diagrams and code snippets for deeper understanding.
Introduction
This translated article (originally from Illustrated GPT‑2 ) offers a visual and textual deep‑dive into GPT‑2, the powerful language model released by OpenAI in August 2019.
1. GPT‑2 and Language Models
1.1 What is a Language Model?
A language model predicts the next word given a preceding context. GPT‑2 can be seen as an advanced version of the predictive keyboard, trained on a 40 GB WebText dataset. The smallest GPT‑2 variant requires ~500 MB of storage, while the largest exceeds 6.5 GB.
1.2 Transformer Foundations
Original Transformers consist of an Encoder and a Decoder. GPT‑2 uses only the Decoder stack, making it a decoder‑only Transformer. Stacking many decoder layers and training on massive text corpora yields the impressive capabilities of GPT‑2.
1.3 Difference from BERT
GPT‑2 employs a masked self‑attention decoder, generating one token at a time (auto‑regressive), whereas BERT uses an encoder with bidirectional self‑attention and is not auto‑regressive.
2. Self‑Attention Mechanics
2.1 Standard Self‑Attention
Self‑Attention computes three vectors for each token: Query , Key , and Value . The Query of a token is dotted with all Keys to obtain scores, which are softmax‑normalized and used to weight the Values, producing a context‑aware representation.
2.2 Masked Self‑Attention
In language modeling, future tokens are masked. An attention mask adds a large negative value (e.g., –1e9) to disallowed positions before the softmax, ensuring a token cannot attend to tokens to its right.
2.3 GPT‑2 Specifics
GPT‑2 stores the Key and Value vectors of previously generated tokens, re‑using them in later steps to avoid recomputation. Each decoder layer performs the following steps:
Create Query , Key , Value matrices via learned weight matrices.
Score the current token’s Query against all stored Keys (masked for future positions).
Weight the stored Values by the scores and sum them.
Concatenate the results of all attention heads and project back to the model dimension.
Pass the output through a two‑layer feed‑forward network (first layer 4× the model size, second layer projects back to the model dimension).
The model processes up to 1024 tokens per sequence, generating one token per iteration until an end‑of‑text token ( <|endoftext|> ) is produced.
3. Beyond Language Modeling
3.1 Machine Translation
Although traditional translation uses an Encoder‑Decoder, a decoder‑only Transformer can also perform translation by conditioning on a source sentence.
3.2 Summarization
GPT‑2 can be fine‑tuned to generate summaries of Wikipedia articles, learning to map long inputs to concise outputs.
3.3 Transfer Learning
Works such as "Sample Efficient Text Summarization Using a Single Pre‑Trained Transformer" demonstrate that a decoder‑only model pre‑trained on language modeling can be adapted to downstream tasks with strong results.
3.4 Music Generation
The Music Transformer applies the same decoder‑only architecture to sequences of MIDI events, modeling both pitch and velocity to generate expressive music.
Conclusion
The article concludes that understanding the inner workings of GPT‑2—its token embeddings, positional encodings, masked self‑attention, and feed‑forward layers—provides a solid foundation for exploring advanced Transformer‑based models across various domains.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.