How Transformers Work: From Tensor Basics to GPU Performance Analysis
This article provides a comprehensive, engineer‑focused breakdown of transformer architecture—including tensor fundamentals, matrix multiplication, GPU theoretical compute, attention and FFN mechanics, quantitative parameter and FLOP analysis, performance metrics like MFU, parallelism strategies, variant optimizations, and practical exercise questions—offering clear insight into large‑model efficiency and scaling.
1. Fundamentals Review
In this section we introduce basic concepts such as tensors, matrix multiplication, and GPU theoretical compute power.
1.1 What is a Tensor
A tensor is a multi‑dimensional array: scalars are 0‑D, vectors are 1‑D, matrices are 2‑D, and higher‑dimensional arrays are called tensors. We denote tensors with brackets, e.g., [B, S, H] for batch, sequence length, and hidden size.
Weights are represented as a 2‑D tensor [H, H].
Activations are a 3‑D tensor [B, S, H].
Multi‑head attention uses a 4‑D tensor [B, S, h, d].
1.2 Matrix Multiplication and Tensor‑Matrix Multiplication
Multiplying an M×K matrix A with a K×N matrix B yields an M×N matrix. The total FLOPs are 2·M·K·N and memory traffic (M·K + K·N + M·N)·sizeof(dtype). Extending to tensors, multiplying a [B, S, H] tensor with a [H, H] matrix is equivalent to copying the matrix B times and performing the same computation, resulting in a [B, S, H] output.
1.3 GPU Theoretical Compute Power
The NVIDIA A800 Tensor Core can perform 2·(8·4·8) operations per clock. With 108 SMs, each containing 4 Tensor Cores, and a clock of 1410 MHz, the theoretical throughput is 311.8 TFLOPS ≈ 312 TFLOPS.
To achieve this, workloads must use Tensor Cores and keep all cores busy with 8×4×8 matrix multiplications.
Real‑world utilization is typically around 80 % of the theoretical peak.
2. Quantitative Analysis of Transformer Architecture
2.1 Original Transformer
The original Transformer consists of an Encoder and a Decoder. The Encoder processes input through repeated Multi‑Head Attention and Feed‑Forward blocks with residual connections and layer normalization. The Decoder adds Cross‑Attention to incorporate encoder outputs.
Encoder‑Decoder models are used for translation and multimodal tasks.
Encoder‑Only models (e.g., BERT) are for classification and extraction.
Decoder‑Only models (e.g., Llama, GPT) are for generative tasks.
2.2 GPT‑2 Structure
During inference, input text is tokenized, embedded, and passed through stacked Decoder layers to produce hidden states, which are projected back to vocabulary logits via a transposed embedding matrix and softmax.
2.3 Pre‑processing
Tokenizer converts text to token IDs, which are looked up in the embedding matrix [V, H] to obtain a [B, S, H] tensor. This is a pure memory operation.
2.4 Post‑processing
Logits are computed as ([B, S, H] × [H, V]) resulting in [B, S, V]; softmax yields token probabilities.
2.5 Multi‑Layer Decoder Processing
Each Decoder layer contains Self‑Attention and a Feed‑Forward Network (FFN).
2.6 Single‑Head and Multi‑Head Attention
Attention computes Q, K, V from the input tensor via a linear projection, then performs Q·Kᵀ, softmax, and multiplication with V. The FLOPs for single‑head attention are 2·B·S·d·S' + 2·B·S·S'·d. Multi‑head attention repeats this for h heads, scaling FLOPs by h.
2.7 Parameter and Compute Cost of Attention
Attention has 4·H·H parameters and 8·B·S·H·H FLOPs.
2.8 FFN (MLP) Structure
FFN performs two matrix multiplications ([H,4H] and [4H,H]) with a ReLU activation in between, totaling 8·H·H parameters and 16·B·S·H·H FLOPs.
2.10 Parameter Analysis
For a model with L layers, total parameters N = L·12·H·H + V·H.
2.11 Inference Compute Analysis
Per‑layer inference FLOPs ≈ 2·B·S·4·H·H + 4·B·S·H·S' + 2·B·S·8·H·H. Summed over L layers.
2.12 Inference Memory Consumption
Memory consists of model parameters (N·sizeof(dtype)), KV cache (L·S'·h·d·2·sizeof(dtype)), and intermediate activations (B·S·H·c).
2.13 From Inference to Training
Training adds backward passes (≈2× forward FLOPs) and optimizer state (≈10× parameter memory for Adam).
2.16 MFU Performance Metric
MFU = actual throughput / (theoretical FLOPs per token). For A800 the practical ceiling is about 80 % of 312 TFLOPS.
2.17 Parallelism Strategies
Data parallelism splits batch B, sequence parallelism splits length S, pipeline parallelism splits layers L, and tensor parallelism splits large weight matrices.
3. Transformer Variants
3.1 Attention Optimizations
Multi‑Query Attention (MQA) reduces K and V heads to 1, saving memory. Grouped‑Query Attention (GQA) uses g < h heads. Sliding‑Window Attention limits the attention matrix to a fixed window, reducing compute and memory.
3.2 FFN Optimizations
SwiGLU replaces ReLU with a gated activation, adding an extra matrix multiplication and increasing parameters by ~33 %.
3.3 Mixture‑of‑Experts (MoE)
MoE partitions the FFN into N experts with a routing network, allowing parameter scaling without proportional compute increase.
4. Practice Questions
4.1 Question 1
Using MFU, the 70 B Llama 3 model trained on 15 T tokens with 640 k GPU‑hours achieves about 31.9 % of the H100’s 989 TFLOPS theoretical peak.
4.2 Question 2
The 7 B parameter count comes from L = 32 layers, H = 4096, intermediate size = 11008, and a non‑shared embedding, yielding ≈7 billion parameters. Inference FLOPs ≈ 2 N (≈14 TFLOPS), training FLOPs ≈ 6 N (≈42 TFLOPS).
4.3 Question 3
Eight A800 GPUs (312 TFLOPS each) cannot meet a 600‑700 ms latency for 16 requests of 4000 tokens; the theoretical minimum is ~3.6 s.
4.4 Question 4
With two A800 cards (≈152 GB usable memory) and a 70 B model (≈140 GB), about 12 GB remains for KV cache, allowing roughly 8 concurrent 4000‑token requests.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
