Demystifying the Transformer: From Input Embedding to Multi‑Head Attention

This article breaks down the core components of the Transformer architecture—including input embedding, positional encoding, multi‑head self‑attention, residual connections with layer normalization, position‑wise feed‑forward networks, and the rationale behind stacking multiple encoder layers—using clear explanations and illustrative diagrams.

AI Architecture Hub
AI Architecture Hub
AI Architecture Hub
Demystifying the Transformer: From Input Embedding to Multi‑Head Attention

1. Input Embedding

Each token t_i from the input sequence is mapped to a dense vector using an embedding matrix E \in \mathbb{R}^{V \times d_{model}} (where V is the vocabulary size and d_{model}=512 in the original Transformer). The embedding for token t_i is x_i = E[t_i]. This converts discrete symbols into continuous representations that can be processed by linear algebra.

Input embedding illustration
Input embedding illustration

2. Positional Encoding

Because the self‑attention mechanism has no inherent notion of order, a deterministic positional vector is added to each token embedding. For position pos and dimension i:

PE_{pos,2i}   = sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
PE_{pos,2i+1} = cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

The final input to the encoder is z_i = x_i + PE_{pos(i)}, allowing the model to distinguish token order.

Positional encoding diagram
Positional encoding diagram

3. Multi‑Head Self‑Attention

For each encoder layer the input matrix X \in \mathbb{R}^{L \times d_{model}} (where L is the sequence length) is linearly projected to queries, keys and values: Q = XW_Q, K = XW_K, V = XW_V where W_Q, W_K, W_V \in \mathbb{R}^{d_{model} \times d_k} and d_k = d_v = d_{model}/h with h=8 heads (so d_k = 64 in the original model).

For each head the attention scores are computed as scaled dot‑product:

Attention(Q,K,V) = softmax\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V

The outputs of the h heads are concatenated and projected back to d_{model} dimensions: MultiHead(X) = Concat(head_1, …, head_h) W_O where W_O \in \mathbb{R}^{hd_v \times d_{model}}. This allows the model to attend to information from multiple representation sub‑spaces simultaneously.

Multi‑head attention diagram
Multi‑head attention diagram

4. Add & Norm (Residual Connection + Layer Normalization)

After the attention sub‑layer a residual connection adds the original input X to the attention output, then layer normalization stabilizes the distribution: Z_1 = LayerNorm(X + MultiHead(X)) This pattern is repeated after the feed‑forward sub‑layer, helping gradients flow through deep stacks.

5. Position‑wise Feed‑Forward Network

The feed‑forward network is applied independently to each position: FFN(z) = max(0, zW_1 + b_1)W_2 + b_2 where W_1 \in \mathbb{R}^{d_{model} \times d_{ff}} with d_{ff}=2048, W_2 \in \mathbb{R}^{d_{ff} \times d_{model}}, and max(0,·) denotes the ReLU activation. This expands the representation, introduces non‑linearity, and then projects back to the original dimension, discarding less useful features.

Feed‑forward network diagram
Feed‑forward network diagram

6. Stacking Encoder Layers

An encoder layer consists of the four sub‑modules described above (self‑attention → Add&Norm → feed‑forward → Add&Norm). In the original paper six such layers are stacked sequentially:

X^{(0)} = input embeddings + positional encodings
for l = 1 … N:
    X^{(l)} = EncoderLayer(X^{(l-1)})

where N=6 in the baseline model. Modern large‑scale Transformers often use dozens or even hundreds of layers; the exact number is a hyper‑parameter chosen based on computational budget and empirical performance.

Stacked encoder layers diagram
Stacked encoder layers diagram

In summary, the Transformer encoder converts raw token sequences into high‑dimensional vectors via learned embeddings and sinusoidal positional encodings, repeatedly refines these vectors through multi‑head self‑attention, residual connections, layer normalization, and position‑wise feed‑forward networks, and finally stacks multiple such layers to build deep contextual representations capable of capturing complex semantic relationships.

deep learningTransformerPositional EncodingMulti-Head AttentionAdd&NormFeed ForwardInput Embedding
AI Architecture Hub
Written by

AI Architecture Hub

Focused on sharing high-quality AI content and practical implementation, helping people learn with fewer missteps and become stronger through AI.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.