Artificial Intelligence 50 min read

Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention

This tutorial walks through the fundamentals of ChatGPT by explaining language modeling, character‑level tokenization, data preprocessing pipelines, the evolution from simple bigram models to scaled dot‑product self‑attention, multi‑head mechanisms, full Transformer blocks, and how to train and generate Shakespeare‑style text with a GPT model.

MoonWebTeam

Oct 1, 2025

Unlocking ChatGPT: A Deep Dive into Transformers, Tokenization, and Self‑Attention

1. The Technology Revolution Behind ChatGPT: Understanding the Transformer Language Model

When ChatGPT appeared, it could generate text character by character, essentially playing an extremely sophisticated "word‑chain" game.

AI knowledge brings prosperity, embrace its power

Language modeling is about teaching a computer to predict the next character given a context, which requires learning statistical patterns from massive text corpora.

Language Modeling Basics: The Highest‑Level Word Chain

# This is what language modeling does
input: "I am happy"
predict: " today"  # most likely next token

Context matters: the same preceding character "好" can lead to different predictions depending on the surrounding words.

Transformer: The 2017 Paper That Changed Everything

The 2017 paper "Attention is All You Need" introduced the Transformer, which lets every token "see" every other token through attention.

# Simplified illustration of token communication
[我] ←→ [今] ←→ [天] ←→ [很] ←→ [开] ←→ [心]

2. Data Preprocessing and Encoding Foundations

Computers only understand numbers, so raw text must be converted into numeric IDs (tokenization). Two common schemes are:

Character‑level tokenization (simple, small vocab, longer sequences)

Sub‑word tokenization (used by ChatGPT, larger vocab, shorter sequences)

Example of building a character‑to‑ID map from Shakespeare:

text = "To be or not to be, that is the question."
chars = sorted(list(set(text)))
char_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_char = {i: ch for i, ch in enumerate(chars)}

After tokenization, the data is split into training (90%) and validation (10%) sets to prevent the model from simply memorizing the training data.

Batching and Context Windows

Because the dataset is huge, it is divided into small batches. Each batch contains a block of block_size tokens that the model processes in parallel.

batch_size = 4
block_size = 8
xb, yb = get_batch('train')
# xb shape: (4, 8)
# yb shape: (4, 8)

3. From Simple to Complex: Bigram Model to Self‑Attention

Bigram Model – The Near‑Sighted AI

A bigram predicts the next token based only on the immediate previous token.

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

Although simple, the bigram learns basic character co‑occurrence patterns.

Why Simple Aggregation Is Not Enough

We need every token to see the entire preceding context, not just the immediate neighbor. A naïve loop can compute an average of all previous tokens, but it is too slow.

def simple_communication(x):
    B, T, C = x.shape
    xbow = torch.zeros((B, T, C))
    for b in range(B):
        for t in range(T):
            xprev = x[b, :t+1]
            xbow[b, t] = torch.mean(xprev, 0)
    return xbow

Using a lower‑triangular matrix and matrix multiplication achieves the same result thousands of times faster.

def matrix_communication(x):
    B, T, C = x.shape
    tril = torch.tril(torch.ones(T, T))
    wei = tril / tril.sum(1, keepdim=True)
    xbow2 = wei @ x
    return xbow2

Dynamic Weights: The Birth of Attention

Fixed uniform weights ignore the fact that some previous tokens are more relevant than others. Attention computes data‑dependent weights via similarity of queries and keys .

def attention_communication(x):
    B, T, C = x.shape
    queries = x
    wei = torch.zeros(T, T)
    for i in range(T):
        for j in range(i+1):
            similarity = torch.dot(queries[0, i], queries[0, j])
            wei[i, j] = similarity
    tril = torch.tril(torch.ones(T, T))
    wei = wei.masked_fill(tril == 0, float('-inf'))
    wei = F.softmax(wei, dim=-1)
    xbow3 = wei @ x
    return xbow3, wei

Scaled Dot‑Product Attention

The core formula is:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Scaling by √dₖ prevents the softmax from saturating.

def scaled_dot_product_attention(q, k, v, mask=None):
    scores = torch.matmul(q, k.transpose(-2, -1))
    scores = scores / math.sqrt(q.shape[-1])
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, v)
    return output, attention_weights

Causal Mask – Preventing the Model from Seeing the Future

def create_causal_mask(seq_len):
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask

Applying the mask ensures that token *i* can only attend to tokens ≤ *i*.

Multi‑Head Attention – Seeing from Many Perspectives

class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([SelfAttentionHead(head_size) for _ in range(n_head)])
        self.proj = nn.Linear(head_size * n_head, n_embd)
    def forward(self, x):
        head_outputs = [h(x) for h in self.heads]
        concatenated = torch.cat(head_outputs, dim=-1)
        return self.proj(concatenated)

Each head learns a different type of relationship (syntactic, semantic, local, long‑range, etc.).

4. Self‑Attention Core Implementation: Query, Key, Value

class SelfAttentionHead(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
    def forward(self, x):
        k = self.key(x)
        q = self.query(x)
        v = self.value(x)
        return scaled_dot_product_attention(q, k, v, mask=self.tril)

Transformer Block – The AI "Thinking Loop"

class TransformerBlock(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Residual connections preserve the original signal and enable stable gradient flow; LayerNorm normalizes each token's features to keep training stable.

Feed‑Forward Network – Deep Thinking per Token

class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(0.2),
        )
    def forward(self, x):
        return self.net(x)

5. Building and Training the Full GPT Model

class GPTLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[TransformerBlock(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.apply(self._init_weights)
    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)
        pos_emb = self.position_embedding_table(torch.arange(T, device=idx.device))
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

Training uses a standard loop with AdamW optimizer, periodic evaluation on a held‑out validation set, and gradient back‑propagation.

model = GPTLanguageModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for iter in range(max_iters):
    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if iter % eval_interval == 0:
        print(f"Step {iter}: train loss {loss.item():.4f}")

Text Generation

def generate_text(model, max_new_tokens=500):
    model.eval()
    context = torch.zeros((1, 1), dtype=torch.long, device=device)
    generated = []
    for _ in range(max_new_tokens):
        logits, _ = model(context)
        logits = logits[:, -1, :]
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, 1)
        context = torch.cat((context, next_token), dim=1)
        generated.append(next_token.item())
    return decode(generated)

sample = generate_text(model, max_new_tokens=1000)
print(sample)

The resulting text mimics Shakespearean style, demonstrating that a relatively small model can learn coherent language patterns.

From GPT to ChatGPT – Alignment

To turn a pure language model into a helpful assistant, four stages are required:

Pre‑training (already covered).

Supervised fine‑tuning on instruction‑response pairs.

Training a reward model to score answer quality.

Reinforcement learning (PPO) using the reward model to align the assistant with human preferences.

This pipeline yields a model that can answer questions, follow instructions, engage in dialogue, and refuse inappropriate requests.

Conclusion

From the simplest bigram to the full Transformer‑based GPT, we have explored how tokenization, attention, multi‑head mechanisms, residual connections, and layer normalization combine to create a powerful language model. Understanding each component demystifies the AI breakthroughs behind ChatGPT and equips you to build, train, and extend your own models.

Python Transformer ChatGPT Self-attention GPT language modeling

Written by

MoonWebTeam

Official account of MoonWebTeam. All members are former front‑end engineers from Tencent, and the account shares valuable team tech insights, reflections, and other information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.