Unveiling Transformer Internals: From Theory to PyTorch Code

This article deeply explores the Transformer architecture by combining original paper principles with PyTorch source code, covering encoder‑decoder design, positional encoding assumptions, core parameters, residual connections, attention mechanisms, and detailed implementation snippets to help readers understand and reproduce the model.

Data Party THU
Data Party THU
Data Party THU
Unveiling Transformer Internals: From Theory to PyTorch Code

Introduction

When discussing large models, the Transformer architecture is a pivotal milestone. It serves as the backbone for most modern large‑model tasks, and understanding its design is essential for grasping contemporary AI systems.

Overall Design

The Transformer consists of two main components: an Encoder on the left and a Decoder on the right. The encoder converts a complete source sequence into a rich semantic representation, while the decoder generates the target sequence token by token, using both the previously generated tokens and the encoder’s output.

Positional Encoding and Its Assumptions

Because the Transformer is inherently insensitive to token order, a Positional Encoding vector must be added to each input embedding. Three key assumptions guide its design:

Determinism : the encoding for a given position must be a fixed numeric value, identical across different sequences.

Relative‑distance consistency : the relative distance between any two positions should remain consistent across sentences.

Generalization : the encoding should extrapolate to longer, unseen sequences.

To satisfy these, the original paper uses a combination of sine and cosine functions:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This formulation allows any position pos+k to be expressed as a linear combination of known positions, facilitating generalization to longer inputs.

Residual Connections Preserve Positional Information

Even though positional encodings are added at the lowest layer, residual connections ensure that this information propagates through all subsequent layers. For an N‑layer network with input x₀ (including positional encoding), the output of layer i is: x_i = x₀ + Σ_{j=1}^{i} F_j(x_{j‑1}) Thus the original positional signal x₀ remains present at every depth.

PyTorch Implementation Overview

The PyTorch implementation resides in /pytorch/torch/nn/modules/transformer.py (v2.5.1). The top‑level class is torch.nn.Transformer, instantiated as:

# Example usage
transformer_model = nn.Transformer(d_model=512, nhead=8, num_encoder_layers=6)
src = torch.rand((10, 32, 512))   # (source_len, batch, d_model)
tgt = torch.rand((20, 32, 512))   # (target_len, batch, d_model)
out = transformer_model(src, tgt)

The Transformer class contains five core parameters:

d_model : feature dimension (default 512)

nhead : number of attention heads

num_encoder_layers : number of encoder blocks (default 6)

num_decoder_layers : number of decoder blocks

dim_feedforward : inner dimension of the feed‑forward network (default 2048)

Each encoder block ( TransformerEncoderLayer) implements:

Multi‑head self‑attention

Residual connection + layer normalization

Position‑wise feed‑forward network (two linear layers: 512→2048→512)

Dropout for regularization

The decoder block ( TransformerDecoderLayer) adds a second attention sub‑module (cross‑attention) that attends to the encoder’s memory.

Class Hierarchy

TransformerEncoderLayer

: implements a single encoder layer TransformerEncoder: stacks multiple encoder layers TransformerDecoderLayer: implements a single decoder layer TransformerDecoder: stacks multiple decoder layers

Forward Pass Logic

During forward, the encoder receives the source sequence and a padding_mask that masks out padding tokens. The decoder receives the target sequence, the encoder’s output (memory), a causal target_mask (upper‑triangular), and a memory_mask derived from source lengths.

Encoder flow:

Self‑attention block → residual → layer norm

Feed‑forward block → residual → layer norm

Decoder flow (per layer):

Self‑attention on the target (masked) → residual → layer norm

Cross‑attention between decoder output and encoder memory → residual → layer norm

Feed‑forward → residual → layer norm

Attention Mechanism Details

Both encoder and decoder use Scaled Dot‑Product Attention :

Attention(Q, K, V) = softmax((Q·Kᵀ) / √d_k) · V

The scaling factor √d_k stabilizes the softmax distribution. The computation steps are:

Compute the dot product between queries Q and keys K.

Divide by √d_k and apply a mask (e.g., padding or causal mask) by setting masked positions to a large negative value.

Apply softmax to obtain attention weights.

Weight the values V by these probabilities and sum.

Multi‑head attention runs several such attention heads in parallel, each with its own linear projections of Q, K, and V. The head outputs are concatenated and linearly projected back to d_model.

Code Example of Scaled Dot‑Product Attention

def scaled_dot_product_attention(q, k, v, mask=None):
    dk = q.size(-1)
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(dk)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    attn = torch.softmax(scores, dim=-1)
    return torch.matmul(attn, v)

Key Takeaways

The Transformer’s strength lies in its ability to model long‑range dependencies via self‑attention.

Positional encodings inject order information while preserving the model’s ability to generalize to unseen sequence lengths.

Residual connections and layer normalization ensure stable gradient flow and preserve positional signals.

PyTorch’s modular implementation mirrors the paper’s design, making it straightforward to customize parameters or replace components.

Illustrative Diagrams

Transformer overview diagram
Transformer overview diagram
Positional encoding illustration
Positional encoding illustration
Residual connection diagram
Residual connection diagram
TransformerEncoder class diagram
TransformerEncoder class diagram
TransformerDecoder class diagram
TransformerDecoder class diagram
Overall Transformer flow
Overall Transformer flow
Encoder attention diagram
Encoder attention diagram
TransformerEncoderLayer internals
TransformerEncoderLayer internals
Decoder cross‑attention diagram
Decoder cross‑attention diagram
Decoder multi‑head attention
Decoder multi‑head attention
Full decoder layer
Full decoder layer
Scaled dot‑product attention
Scaled dot‑product attention
Attention intuition diagram
Attention intuition diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningTransformerNeural NetworksattentionPyTorchPositional Encoding
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.