Artificial Intelligence 24 min read

Illustrated Guide to GPT-2: Detailed Explanation of the Decoder‑Only Transformer Model

This article provides a comprehensive, illustrated walkthrough of OpenAI's GPT‑2 language model, covering its decoder‑only Transformer architecture, self‑attention mechanisms, token processing, training data, differences from BERT, and applications beyond language modeling, enriched with visual diagrams and code snippets for deeper understanding.

Sohu Tech Products

Nov 25, 2020

Illustrated Guide to GPT-2: Detailed Explanation of the Decoder‑Only Transformer Model

Introduction

This translated article (originally from Illustrated GPT‑2 ) offers a visual and textual deep‑dive into GPT‑2, the powerful language model released by OpenAI in August 2019.

1. GPT‑2 and Language Models

1.1 What is a Language Model?

A language model predicts the next word given a preceding context. GPT‑2 can be seen as an advanced version of the predictive keyboard, trained on a 40 GB WebText dataset. The smallest GPT‑2 variant requires ~500 MB of storage, while the largest exceeds 6.5 GB.

1.2 Transformer Foundations

Original Transformers consist of an Encoder and a Decoder. GPT‑2 uses only the Decoder stack, making it a decoder‑only Transformer. Stacking many decoder layers and training on massive text corpora yields the impressive capabilities of GPT‑2.

1.3 Difference from BERT

GPT‑2 employs a masked self‑attention decoder, generating one token at a time (auto‑regressive), whereas BERT uses an encoder with bidirectional self‑attention and is not auto‑regressive.

2. Self‑Attention Mechanics

2.1 Standard Self‑Attention

Self‑Attention computes three vectors for each token: Query, Key, and Value. The Query of a token is dotted with all Keys to obtain scores, which are softmax‑normalized and used to weight the Values, producing a context‑aware representation.

2.2 Masked Self‑Attention

In language modeling, future tokens are masked. An attention mask adds a large negative value (e.g., –1e9) to disallowed positions before the softmax, ensuring a token cannot attend to tokens to its right.

2.3 GPT‑2 Specifics

GPT‑2 stores the Key and Value vectors of previously generated tokens, re‑using them in later steps to avoid recomputation. Each decoder layer performs the following steps:

Create Query, Key, Value matrices via learned weight matrices.

Score the current token’s Query against all stored Keys (masked for future positions).

Weight the stored Values by the scores and sum them.

Concatenate the results of all attention heads and project back to the model dimension.

Pass the output through a two‑layer feed‑forward network (first layer 4× the model size, second layer projects back to the model dimension).

The model processes up to 1024 tokens per sequence, generating one token per iteration until an end‑of‑text token ( <|endoftext|>) is produced.

3. Beyond Language Modeling

3.1 Machine Translation

Although traditional translation uses an Encoder‑Decoder, a decoder‑only Transformer can also perform translation by conditioning on a source sentence.

3.2 Summarization

GPT‑2 can be fine‑tuned to generate summaries of Wikipedia articles, learning to map long inputs to concise outputs.

3.3 Transfer Learning

Works such as "Sample Efficient Text Summarization Using a Single Pre‑Trained Transformer" demonstrate that a decoder‑only model pre‑trained on language modeling can be adapted to downstream tasks with strong results.

3.4 Music Generation

The Music Transformer applies the same decoder‑only architecture to sequences of MIDI events, modeling both pitch and velocity to generate expressive music.

Conclusion

The article concludes that understanding the inner workings of GPT‑2—its token embeddings, positional encodings, masked self‑attention, and feed‑forward layers—provides a solid foundation for exploring advanced Transformer‑based models across various domains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Transformer Self-Attention Language Model GPT-2

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.