Demystifying the Transformer: From Input Embedding to Multi‑Head Attention
This article breaks down the core components of the Transformer architecture—including input embedding, positional encoding, multi‑head self‑attention, residual connections with layer normalization, position‑wise feed‑forward networks, and the rationale behind stacking multiple encoder layers—using clear explanations and illustrative diagrams.
1. Input Embedding
Each token t_i from the input sequence is mapped to a dense vector using an embedding matrix E \in \mathbb{R}^{V \times d_{model}} (where V is the vocabulary size and d_{model}=512 in the original Transformer). The embedding for token t_i is x_i = E[t_i]. This converts discrete symbols into continuous representations that can be processed by linear algebra.
2. Positional Encoding
Because the self‑attention mechanism has no inherent notion of order, a deterministic positional vector is added to each token embedding. For position pos and dimension i:
PE_{pos,2i} = sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
PE_{pos,2i+1} = cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)The final input to the encoder is z_i = x_i + PE_{pos(i)}, allowing the model to distinguish token order.
3. Multi‑Head Self‑Attention
For each encoder layer the input matrix X \in \mathbb{R}^{L \times d_{model}} (where L is the sequence length) is linearly projected to queries, keys and values: Q = XW_Q, K = XW_K, V = XW_V where W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_k} and d_k = d_v = d_{model}/h with h=8 heads (so d_k = 64 in the original model).
For each head the attention scores are computed as scaled dot‑product:
Attention(Q,K,V) = softmax\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) VThe outputs of the h heads are concatenated and projected back to d_{model} dimensions: MultiHead(X) = Concat(head_1, …, head_h) W_O where W_O \in \mathbb{R}^{hd_v \times d_{model}}. This allows the model to attend to information from multiple representation sub‑spaces simultaneously.
4. Add & Norm (Residual Connection + Layer Normalization)
After the attention sub‑layer a residual connection adds the original input X to the attention output, then layer normalization stabilizes the distribution: Z_1 = LayerNorm(X + MultiHead(X)) This pattern is repeated after the feed‑forward sub‑layer, helping gradients flow through deep stacks.
5. Position‑wise Feed‑Forward Network
The feed‑forward network is applied independently to each position: FFN(z) = max(0, zW_1 + b_1)W_2 + b_2 where W_1 \in \mathbb{R}^{d_{model} \times d_{ff}} with d_{ff}=2048, W_2 \in \mathbb{R}^{d_{ff} \times d_{model}}, and max(0,·) denotes the ReLU activation. This expands the representation, introduces non‑linearity, and then projects back to the original dimension, discarding less useful features.
6. Stacking Encoder Layers
An encoder layer consists of the four sub‑modules described above (self‑attention → Add&Norm → feed‑forward → Add&Norm). In the original paper six such layers are stacked sequentially:
X^{(0)} = input embeddings + positional encodings
for l = 1 … N:
X^{(l)} = EncoderLayer(X^{(l-1)})where N=6 in the baseline model. Modern large‑scale Transformers often use dozens or even hundreds of layers; the exact number is a hyper‑parameter chosen based on computational budget and empirical performance.
In summary, the Transformer encoder converts raw token sequences into high‑dimensional vectors via learned embeddings and sinusoidal positional encodings, repeatedly refines these vectors through multi‑head self‑attention, residual connections, layer normalization, and position‑wise feed‑forward networks, and finally stacks multiple such layers to build deep contextual representations capable of capturing complex semantic relationships.
AI Architecture Hub
Focused on sharing high-quality AI content and practical implementation, helping people learn with fewer missteps and become stronger through AI.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
