Artificial Intelligence 10 min read

Understanding the Transformer Architecture: Encoder, Decoder, and Attention Mechanisms

This article explains the Transformer model, comparing it with RNNs, detailing its encoder‑decoder structure, multi‑head and scaled dot‑product attention, embedding layers, feed‑forward networks, and the final linear‑softmax output, supplemented with diagrams and code examples.

Rare Earth Juejin Tech Community

Nov 15, 2023

Understanding the Transformer Architecture: Encoder, Decoder, and Attention Mechanisms

What is a Transformer

The Transformer is currently the most popular feature extractor in deep learning, having replaced RNNs in most scenarios due to its parallelism and constant token‑to‑token distance.

Transformer vs RNN

RNN : cannot be parallelized because each step depends on the previous hidden state, leading to slow training; also suffers from gradient vanishing/exploding when token distances are long.

Transformer : processes all tokens in parallel and treats the distance between any two tokens as 1, which makes it fast and effective.

Understanding the Transformer

The model can be visualized as a box that receives an input, passes through an encoder‑decoder pair, and produces an output.

graph LR
A[Input: Machine Learning] --> B[Model: Transformer (Encoder + Decoder)] --> C[Output: Machine Learning]

Example: Chinese sentence → English translation.

graph LR
A[我爱购物] --> B[Transformer Model] --> C[I love shopping]

The box expands into an encoder and a decoder, each composed of six identical layers.

Overall Architecture Diagram

The model follows an encoder‑decoder structure. The encoder receives an input sequence x = (x₁,…,xₙ) and outputs z = (z₁,…,zₙ). The decoder generates an output sequence y = (y₁,…,yₘ) one token at a time, using previously generated tokens as additional input.

Both encoder and decoder consist of self‑attention and feed‑forward sub‑layers. The original paper "Attention Is All You Need" shows that each side contains six identical blocks.

Encoder

Each encoder layer has two sub‑layers:

Multi‑Head Attention

Position‑wise Feed Forward Network (two linear transformations with a ReLU activation in between)

Each sub‑layer is wrapped with an Add&Norm (residual connection followed by layer normalization).

Decoder

The decoder adds a third sub‑layer to each block:

Masked Multi‑Head Attention (prevents a position from attending to future positions)

Cross Multi‑Head Attention (queries come from the decoder, keys and values come from the encoder output)

Add&Norm after each sub‑layer, similar to the encoder

Masked Self‑Attention

A mask is applied so that the model cannot look ahead during training; two types of masks are used: Padding Mask (applied in all scaled‑dot‑product attention) and Sequence Mask (used only in the decoder’s self‑attention).

Input Embeddings

Both encoder and decoder inputs are built from token embedding plus positional embedding , producing the final input vectors.

Word embeddings convert words to unique vectors, while positional embeddings encode the position of each token using sinusoidal functions.

Attention Mechanism

Attention maps a query vector and a set of key‑value pairs to an output, which is a weighted sum of the values. The weights are computed by a compatibility function between the query and each key.

Multi‑Head Attention

Both encoder and decoder contain multi‑head attention modules; the decoder has two such modules.

Scaled Dot‑Product Attention

Each head computes attention by taking the dot product of queries and keys, scaling by √dₖ, applying a softmax, and then weighting the values.

Feed‑Forward Network

After attention, each layer applies a fully‑connected feed‑forward network consisting of two linear transformations with a ReLU activation in between, applied independently to each position.

Encoder Diagram

Decoder Diagram

The decoder mirrors the encoder but includes masked self‑attention, cross‑attention, and a final softmax layer that predicts the next token.

Linear and Softmax

The decoder output is passed through a linear layer to match the vocabulary size, followed by a softmax to obtain probability distribution over possible next words.

References

https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

https://www.yiyibooks.cn/yiyibooks/Attention_Is_All_You_Need/index.html

https://www.u72.net/chengxu/show-105186.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence Deep Learning Transformer neural networks attention Encoder-Decoder

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.