Unpacking the Transformer: From Embeddings to Multi‑Head Attention

This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.

AI Cyberspace
AI Cyberspace
AI Cyberspace
Unpacking the Transformer: From Embeddings to Multi‑Head Attention

Transformer Architecture Overview

The Transformer follows an Encoder‑Decoder design, allowing input and output sequences of different lengths while removing recurrent connections.

Variable‑length sequences are processed easily.

Both Encoder and Decoder consist solely of attention mechanisms, eliminating RNN‑related long‑range dependency and serial computation issues.

The Encoder uses Self‑Attention; the Decoder combines Self‑Attention and Cross‑Attention, forming a universal Seq2Seq framework.

Input Embedding

Natural‑language inputs are first converted into dense vectors via an embedding layer. Key concepts:

document : a line or sentence in a dataset, variable length.

token : the smallest textual unit (character, word, sub‑word).

tokenize : splitting text into tokens.

token ID : numeric identifier for each token.

embedding : mapping token IDs to dense vectors.

vocab_size : size of the vocabulary, often tens of thousands.

embedding_dim / hidden_size : dimensionality of token vectors.

batch_size : number of documents processed per training step.

seq_len : maximum length of a document after truncation.

head_size : number of attention heads; each head’s dimension d_k = hidden_size / head_size.

sample : a concatenated token stream sliced to uniform length for efficient batching.

The embedding pipeline consists of:

Truncate long texts to seq_len.

Tokenize (normalization, pre‑tokenization, post‑processing).

Map tokens to IDs.

Lookup embeddings to produce an input tensor of shape [seq_len, batch_size, hidden_size].

Positional Encoding

Since attention lacks inherent order information, positional encodings inject sequence position into token embeddings. Three main types are:

Absolute encoding : unique sinusoidal vectors for each position.

Relative encoding : encodes pairwise distance during attention computation.

Rotary encoding (RoPE) : rotates query/key vectors so that relative positions are naturally captured; widely used in modern LLMs.

Transformers generate a sinusoidal matrix (sin/cos) and add it to token embeddings, yielding position‑aware input tensors.

Encoder

Soft‑Alignment Attention Concept

Attention treats each token as a query (Q) that searches a set of keys (K) to retrieve values (V), analogous to an information‑retrieval process.

{
    "apple": 0.6,
    "banana": 0.4,
    "chair": 0.0
}

Weighted sum yields the final value:

Value = 10*0.6 + 5*0.4 + 2*0 = 8

Scaled Dot‑Product Attention

All attention variants (Self, Cross, Masked) share this core algorithm. Shapes after embedding:

Q: [seq_len, d_k] K: [seq_len, d_k] V:

[seq_len, d_k]
def attention(query, key, value, dropout=None):
    """Compute scaled dot‑product attention.
    query: [seq_len, d_k]
    key:   [seq_len, d_k]
    value: [seq_len, d_k]
    """
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

Self‑Attention

Self‑Attention processes the input sequence in parallel, allowing every token to attend to every other token. Example sentence "我吃了苹果,它很甜": the token "它" attends most strongly to "苹果".

No RNN‑style sequential bottleneck.

Fully parallelizable across multiple GPUs/TPUs.

Multi‑Head Attention (MHA)

Single‑head attention captures only one type of relationship. MHA splits Q/K/V into h heads, each learning distinct patterns (syntactic, semantic, positional, etc.). The process:

Linear projections split Q/K/V into h sub‑matrices.

Each head performs independent scaled dot‑product attention.

Outputs are concatenated.

A final linear layer fuses the concatenated features back to hidden_size.

Feed‑Forward Neural Network (FFN)

FFN provides the only non‑linear transformation in the Transformer. It consists of:

Linear expansion: W1 ∈ [hidden_size, 4·hidden_size].

ReLU activation (or GELU in later models).

Linear compression: W2 ∈ [4·hidden_size, hidden_size].

class MLP(nn.Module):
    """Feed‑forward network used in Transformers."""
    def __init__(self, dim: int, hidden_dim: int, dropout: float):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        return self.dropout(self.w2(F.relu(self.w1(x))))

Residual Connection & Layer Normalization

Each sub‑layer output is added to its input (residual) before applying LayerNorm, which normalizes across the feature dimension of each sample, stabilizing training for deep stacks.

class LayerNorm(nn.Module):
    """Layer‑norm implementation."""
    def __init__(self, features, eps=1e-6):
        super().__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps
    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

Decoder

Autoregressive Generation

The Decoder generates tokens one by one, feeding each newly produced token back as input for the next step.

Start with a <Begin> token.

Apply masked self‑attention on already generated tokens.

Apply cross‑attention using Encoder outputs as K/V.

Project to vocabulary logits, apply Softmax, sample the next token.

Repeat until <End> is produced.

Masked Multi‑Head Attention

Masking prevents a token from attending to future positions, ensuring causal generation. The mask is an upper‑triangular matrix filled with large negative values before Softmax.

Attention Optimizations

Modern LLMs adopt several optimizations to reduce the O(n²) cost of vanilla self‑attention:

Sparse Attention : compute attention only for a subset of token pairs (e.g., local windows, atrous patterns).

Flash Attention : keep intermediate results in fast SRAM to cut global memory traffic.

Page Attention : paginate KV cache to mitigate memory fragmentation (vLLM).

Radix Attention : reorganize cache layout for higher hit rates (SGLang).

Variations of multi‑head designs include:

MHA : independent Q/K/V per head (standard).

MQA : shared K/V across heads, minimal KV cache.

GQA : groups of heads share K/V, balancing efficiency and performance.

MLA : each head has its own K/V but projects to a shared latent space, reducing cache while retaining quality.

Attention optimization overview
Attention optimization overview

Conclusion

The Transformer’s combination of parallelizable attention, multi‑head diversity, feed‑forward non‑linearity, residual pathways, and layer normalization makes it a powerful universal sequence model. Ongoing research focuses on sparsifying attention, improving cache efficiency, and designing new head‑sharing schemes to scale LLMs to ever larger contexts.

deep learningTransformerAttentionPositional EncodingSelf-attentionMulti-Head AttentionFeed-Forward Network
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.