Unpacking the Transformer: From Embeddings to Multi‑Head Attention
This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.
Transformer Architecture Overview
The Transformer follows an Encoder‑Decoder design, allowing input and output sequences of different lengths while removing recurrent connections.
Variable‑length sequences are processed easily.
Both Encoder and Decoder consist solely of attention mechanisms, eliminating RNN‑related long‑range dependency and serial computation issues.
The Encoder uses Self‑Attention; the Decoder combines Self‑Attention and Cross‑Attention, forming a universal Seq2Seq framework.
Input Embedding
Natural‑language inputs are first converted into dense vectors via an embedding layer. Key concepts:
document : a line or sentence in a dataset, variable length.
token : the smallest textual unit (character, word, sub‑word).
tokenize : splitting text into tokens.
token ID : numeric identifier for each token.
embedding : mapping token IDs to dense vectors.
vocab_size : size of the vocabulary, often tens of thousands.
embedding_dim / hidden_size : dimensionality of token vectors.
batch_size : number of documents processed per training step.
seq_len : maximum length of a document after truncation.
head_size : number of attention heads; each head’s dimension d_k = hidden_size / head_size.
sample : a concatenated token stream sliced to uniform length for efficient batching.
The embedding pipeline consists of:
Truncate long texts to seq_len.
Tokenize (normalization, pre‑tokenization, post‑processing).
Map tokens to IDs.
Lookup embeddings to produce an input tensor of shape [seq_len, batch_size, hidden_size].
Positional Encoding
Since attention lacks inherent order information, positional encodings inject sequence position into token embeddings. Three main types are:
Absolute encoding : unique sinusoidal vectors for each position.
Relative encoding : encodes pairwise distance during attention computation.
Rotary encoding (RoPE) : rotates query/key vectors so that relative positions are naturally captured; widely used in modern LLMs.
Transformers generate a sinusoidal matrix (sin/cos) and add it to token embeddings, yielding position‑aware input tensors.
Encoder
Soft‑Alignment Attention Concept
Attention treats each token as a query (Q) that searches a set of keys (K) to retrieve values (V), analogous to an information‑retrieval process.
{
"apple": 0.6,
"banana": 0.4,
"chair": 0.0
}Weighted sum yields the final value:
Value = 10*0.6 + 5*0.4 + 2*0 = 8Scaled Dot‑Product Attention
All attention variants (Self, Cross, Masked) share this core algorithm. Shapes after embedding:
Q: [seq_len, d_k] K: [seq_len, d_k] V:
[seq_len, d_k] def attention(query, key, value, dropout=None):
"""Compute scaled dot‑product attention.
query: [seq_len, d_k]
key: [seq_len, d_k]
value: [seq_len, d_k]
"""
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
p_attn = scores.softmax(dim=-1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attnSelf‑Attention
Self‑Attention processes the input sequence in parallel, allowing every token to attend to every other token. Example sentence "我吃了苹果,它很甜": the token "它" attends most strongly to "苹果".
No RNN‑style sequential bottleneck.
Fully parallelizable across multiple GPUs/TPUs.
Multi‑Head Attention (MHA)
Single‑head attention captures only one type of relationship. MHA splits Q/K/V into h heads, each learning distinct patterns (syntactic, semantic, positional, etc.). The process:
Linear projections split Q/K/V into h sub‑matrices.
Each head performs independent scaled dot‑product attention.
Outputs are concatenated.
A final linear layer fuses the concatenated features back to hidden_size.
Feed‑Forward Neural Network (FFN)
FFN provides the only non‑linear transformation in the Transformer. It consists of:
Linear expansion: W1 ∈ [hidden_size, 4·hidden_size].
ReLU activation (or GELU in later models).
Linear compression: W2 ∈ [4·hidden_size, hidden_size].
class MLP(nn.Module):
"""Feed‑forward network used in Transformers."""
def __init__(self, dim: int, hidden_dim: int, dropout: float):
super().__init__()
self.w1 = nn.Linear(dim, hidden_dim, bias=False)
self.w2 = nn.Linear(hidden_dim, dim, bias=False)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.dropout(self.w2(F.relu(self.w1(x))))Residual Connection & Layer Normalization
Each sub‑layer output is added to its input (residual) before applying LayerNorm, which normalizes across the feature dimension of each sample, stabilizing training for deep stacks.
class LayerNorm(nn.Module):
"""Layer‑norm implementation."""
def __init__(self, features, eps=1e-6):
super().__init__()
self.a_2 = nn.Parameter(torch.ones(features))
self.b_2 = nn.Parameter(torch.zeros(features))
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.a_2 * (x - mean) / (std + self.eps) + self.b_2Decoder
Autoregressive Generation
The Decoder generates tokens one by one, feeding each newly produced token back as input for the next step.
Start with a <Begin> token.
Apply masked self‑attention on already generated tokens.
Apply cross‑attention using Encoder outputs as K/V.
Project to vocabulary logits, apply Softmax, sample the next token.
Repeat until <End> is produced.
Masked Multi‑Head Attention
Masking prevents a token from attending to future positions, ensuring causal generation. The mask is an upper‑triangular matrix filled with large negative values before Softmax.
Attention Optimizations
Modern LLMs adopt several optimizations to reduce the O(n²) cost of vanilla self‑attention:
Sparse Attention : compute attention only for a subset of token pairs (e.g., local windows, atrous patterns).
Flash Attention : keep intermediate results in fast SRAM to cut global memory traffic.
Page Attention : paginate KV cache to mitigate memory fragmentation (vLLM).
Radix Attention : reorganize cache layout for higher hit rates (SGLang).
Variations of multi‑head designs include:
MHA : independent Q/K/V per head (standard).
MQA : shared K/V across heads, minimal KV cache.
GQA : groups of heads share K/V, balancing efficiency and performance.
MLA : each head has its own K/V but projects to a shared latent space, reducing cache while retaining quality.
Conclusion
The Transformer’s combination of parallelizable attention, multi‑head diversity, feed‑forward non‑linearity, residual pathways, and layer normalization makes it a powerful universal sequence model. Ongoing research focuses on sparsifying attention, improving cache efficiency, and designing new head‑sharing schemes to scale LLMs to ever larger contexts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
