Understanding the Transformer Architecture: Encoder, Decoder, and Attention Mechanisms
This article explains the Transformer model, comparing it with RNNs, detailing its encoder‑decoder structure, multi‑head and scaled dot‑product attention, embedding layers, feed‑forward networks, and the final linear‑softmax output, supplemented with diagrams and code examples.
What is a Transformer
The Transformer is currently the most popular feature extractor in deep learning, having replaced RNNs in most scenarios due to its parallelism and constant token‑to‑token distance.
Transformer vs RNN
RNN : cannot be parallelized because each step depends on the previous hidden state, leading to slow training; also suffers from gradient vanishing/exploding when token distances are long.
Transformer : processes all tokens in parallel and treats the distance between any two tokens as 1, which makes it fast and effective.
Understanding the Transformer
The model can be visualized as a box that receives an input, passes through an encoder‑decoder pair, and produces an output.
graph LR
A[Input: Machine Learning] --> B[Model: Transformer (Encoder + Decoder)] --> C[Output: Machine Learning]Example: Chinese sentence → English translation.
graph LR
A[我爱购物] --> B[Transformer Model] --> C[I love shopping]The box expands into an encoder and a decoder, each composed of six identical layers.
Overall Architecture Diagram
The model follows an encoder‑decoder structure. The encoder receives an input sequence x = (x₁,…,xₙ) and outputs z = (z₁,…,zₙ). The decoder generates an output sequence y = (y₁,…,yₘ) one token at a time, using previously generated tokens as additional input.
Both encoder and decoder consist of self‑attention and feed‑forward sub‑layers. The original paper "Attention Is All You Need" shows that each side contains six identical blocks.
Encoder
Each encoder layer has two sub‑layers:
Multi‑Head Attention
Position‑wise Feed Forward Network (two linear transformations with a ReLU activation in between)
Each sub‑layer is wrapped with an Add&Norm (residual connection followed by layer normalization).
Decoder
The decoder adds a third sub‑layer to each block:
Masked Multi‑Head Attention (prevents a position from attending to future positions)
Cross Multi‑Head Attention (queries come from the decoder, keys and values come from the encoder output)
Add&Norm after each sub‑layer, similar to the encoder
Masked Self‑Attention
A mask is applied so that the model cannot look ahead during training; two types of masks are used: Padding Mask (applied in all scaled‑dot‑product attention) and Sequence Mask (used only in the decoder’s self‑attention).
Input Embeddings
Both encoder and decoder inputs are built from token embedding plus positional embedding , producing the final input vectors.
Word embeddings convert words to unique vectors, while positional embeddings encode the position of each token using sinusoidal functions.
Attention Mechanism
Attention maps a query vector and a set of key‑value pairs to an output, which is a weighted sum of the values. The weights are computed by a compatibility function between the query and each key.
Multi‑Head Attention
Both encoder and decoder contain multi‑head attention modules; the decoder has two such modules.
Scaled Dot‑Product Attention
Each head computes attention by taking the dot product of queries and keys, scaling by √dₖ, applying a softmax, and then weighting the values.
Feed‑Forward Network
After attention, each layer applies a fully‑connected feed‑forward network consisting of two linear transformations with a ReLU activation in between, applied independently to each position.
Encoder Diagram
Decoder Diagram
The decoder mirrors the encoder but includes masked self‑attention, cross‑attention, and a final softmax layer that predicts the next token.
Linear and Softmax
The decoder output is passed through a linear layer to match the vocabulary size, followed by a softmax to obtain probability distribution over possible next words.
References
References
https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
https://www.yiyibooks.cn/yiyibooks/Attention_Is_All_You_Need/index.html
https://www.u72.net/chengxu/show-105186.html
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.