Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention
This tutorial walks through the fundamentals of ChatGPT by explaining language modeling, character‑level tokenization, data preprocessing pipelines, the evolution from simple bigram models to scaled dot‑product self‑attention, multi‑head mechanisms, full Transformer blocks, and how to train and generate Shakespeare‑style text with a GPT model.
1. The Technology Revolution Behind ChatGPT: Understanding the Transformer Language Model
When ChatGPT appeared, it could generate text character by character, essentially playing an extremely sophisticated "word‑chain" game.
AI knowledge brings prosperity, embrace its power
Language modeling is about teaching a computer to predict the next character given a context, which requires learning statistical patterns from massive text corpora.
Language Modeling Basics: The Highest‑Level Word Chain
# This is what language modeling does
input: "I am happy"
predict: " today" # most likely next tokenContext matters: the same preceding character "好" can lead to different predictions depending on the surrounding words.
Transformer: The 2017 Paper That Changed Everything
The 2017 paper "Attention is All You Need" introduced the Transformer, which lets every token "see" every other token through attention.
# Simplified illustration of token communication
[我] ←→ [今] ←→ [天] ←→ [很] ←→ [开] ←→ [心]2. Data Preprocessing and Encoding Foundations
Computers only understand numbers, so raw text must be converted into numeric IDs (tokenization). Two common schemes are:
Character‑level tokenization (simple, small vocab, longer sequences)
Sub‑word tokenization (used by ChatGPT, larger vocab, shorter sequences)
Example of building a character‑to‑ID map from Shakespeare:
text = "To be or not to be, that is the question."
chars = sorted(list(set(text)))
char_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_char = {i: ch for i, ch in enumerate(chars)}After tokenization, the data is split into training (90%) and validation (10%) sets to prevent the model from simply memorizing the training data.
Batching and Context Windows
Because the dataset is huge, it is divided into small batches. Each batch contains a block of block_size tokens that the model processes in parallel.
batch_size = 4
block_size = 8
xb, yb = get_batch('train')
# xb shape: (4, 8)
# yb shape: (4, 8)3. From Simple to Complex: Bigram Model to Self‑Attention
Bigram Model – The Near‑Sighted AI
A bigram predicts the next token based only on the immediate previous token.
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
logits = self.token_embedding_table(idx)
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, lossAlthough simple, the bigram learns basic character co‑occurrence patterns.
Why Simple Aggregation Is Not Enough
We need every token to see the entire preceding context, not just the immediate neighbor. A naïve loop can compute an average of all previous tokens, but it is too slow.
def simple_communication(x):
B, T, C = x.shape
xbow = torch.zeros((B, T, C))
for b in range(B):
for t in range(T):
xprev = x[b, :t+1]
xbow[b, t] = torch.mean(xprev, 0)
return xbowUsing a lower‑triangular matrix and matrix multiplication achieves the same result thousands of times faster.
def matrix_communication(x):
B, T, C = x.shape
tril = torch.tril(torch.ones(T, T))
wei = tril / tril.sum(1, keepdim=True)
xbow2 = wei @ x
return xbow2Dynamic Weights: The Birth of Attention
Fixed uniform weights ignore the fact that some previous tokens are more relevant than others. Attention computes data‑dependent weights via similarity of queries and keys .
def attention_communication(x):
B, T, C = x.shape
queries = x
wei = torch.zeros(T, T)
for i in range(T):
for j in range(i+1):
similarity = torch.dot(queries[0, i], queries[0, j])
wei[i, j] = similarity
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
return xbow3, weiScaled Dot‑Product Attention
The core formula is:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
Scaling by √dₖ prevents the softmax from saturating.
def scaled_dot_product_attention(q, k, v, mask=None):
scores = torch.matmul(q, k.transpose(-2, -1))
scores = scores / math.sqrt(q.shape[-1])
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, v)
return output, attention_weightsCausal Mask – Preventing the Model from Seeing the Future
def create_causal_mask(seq_len):
mask = torch.tril(torch.ones(seq_len, seq_len))
return maskApplying the mask ensures that token *i* can only attend to tokens ≤ *i*.
Multi‑Head Attention – Seeing from Many Perspectives
class MultiHeadAttention(nn.Module):
def __init__(self, n_head, head_size):
super().__init__()
self.heads = nn.ModuleList([SelfAttentionHead(head_size) for _ in range(n_head)])
self.proj = nn.Linear(head_size * n_head, n_embd)
def forward(self, x):
head_outputs = [h(x) for h in self.heads]
concatenated = torch.cat(head_outputs, dim=-1)
return self.proj(concatenated)Each head learns a different type of relationship (syntactic, semantic, local, long‑range, etc.).
4. Self‑Attention Core Implementation: Query, Key, Value
class SelfAttentionHead(nn.Module):
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
def forward(self, x):
k = self.key(x)
q = self.query(x)
v = self.value(x)
return scaled_dot_product_attention(q, k, v, mask=self.tril)Transformer Block – The AI "Thinking Loop"
class TransformerBlock(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return xResidual connections preserve the original signal and enable stable gradient flow; LayerNorm normalizes each token's features to keep training stable.
Feed‑Forward Network – Deep Thinking per Token
class FeedForward(nn.Module):
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
nn.Dropout(0.2),
)
def forward(self, x):
return self.net(x)5. Building and Training the Full GPT Model
class GPTLanguageModel(nn.Module):
def __init__(self):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[TransformerBlock(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd)
self.lm_head = nn.Linear(n_embd, vocab_size)
self.apply(self._init_weights)
def forward(self, idx, targets=None):
B, T = idx.shape
tok_emb = self.token_embedding_table(idx)
pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device))
x = tok_emb + pos_emb
x = self.blocks(x)
x = self.ln_f(x)
logits = self.lm_head(x)
if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, loss
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)Training uses a standard loop with AdamW optimizer, periodic evaluation on a held‑out validation set, and gradient back‑propagation.
model = GPTLanguageModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for iter in range(max_iters):
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if iter % eval_interval == 0:
print(f"Step {iter}: train loss {loss.item():.4f}")Text Generation
def generate_text(model, max_new_tokens=500):
model.eval()
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated = []
for _ in range(max_new_tokens):
logits, _ = model(context)
logits = logits[:, -1, :]
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, 1)
context = torch.cat((context, next_token), dim=1)
generated.append(next_token.item())
return decode(generated)
sample = generate_text(model, max_new_tokens=1000)
print(sample)The resulting text mimics Shakespearean style, demonstrating that a relatively small model can learn coherent language patterns.
From GPT to ChatGPT – Alignment
To turn a pure language model into a helpful assistant, four stages are required:
Pre‑training (already covered).
Supervised fine‑tuning on instruction‑response pairs.
Training a reward model to score answer quality.
Reinforcement learning (PPO) using the reward model to align the assistant with human preferences.
This pipeline yields a model that can answer questions, follow instructions, engage in dialogue, and refuse inappropriate requests.
Conclusion
From the simplest bigram to the full Transformer‑based GPT, we have explored how tokenization, attention, multi‑head mechanisms, residual connections, and layer normalization combine to create a powerful language model. Understanding each component demystifies the AI breakthroughs behind ChatGPT and equips you to build, train, and extend your own models.
MoonWebTeam
Official account of MoonWebTeam. All members are former front‑end engineers from Tencent, and the account shares valuable team tech insights, reflections, and other information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
