Artificial Intelligence 33 min read

How Transformers Work: From Tensor Basics to GPU Performance Analysis

This article provides a comprehensive, engineer‑focused breakdown of transformer architecture—including tensor fundamentals, matrix multiplication, GPU theoretical compute, attention and FFN mechanics, quantitative parameter and FLOP analysis, performance metrics like MFU, parallelism strategies, variant optimizations, and practical exercise questions—offering clear insight into large‑model efficiency and scaling.

Baidu Intelligent Cloud Tech Hub

Jul 25, 2024

How Transformers Work: From Tensor Basics to GPU Performance Analysis

1. Fundamentals Review

In this section we introduce basic concepts such as tensors, matrix multiplication, and GPU theoretical compute power.

1.1 What is a Tensor

A tensor is a multi‑dimensional array: scalars are 0‑D, vectors are 1‑D, matrices are 2‑D, and higher‑dimensional arrays are called tensors. We denote tensors with brackets, e.g., [B, S, H] for batch, sequence length, and hidden size.

Weights are represented as a 2‑D tensor [H, H].

Activations are a 3‑D tensor [B, S, H].

Multi‑head attention uses a 4‑D tensor [B, S, h, d].

1.2 Matrix Multiplication and Tensor‑Matrix Multiplication

Multiplying an M×K matrix A with a K×N matrix B yields an M×N matrix. The total FLOPs are 2·M·K·N and memory traffic (M·K + K·N + M·N)·sizeof(dtype). Extending to tensors, multiplying a [B, S, H] tensor with a [H, H] matrix is equivalent to copying the matrix B times and performing the same computation, resulting in a [B, S, H] output.

1.3 GPU Theoretical Compute Power

The NVIDIA A800 Tensor Core can perform 2·(8·4·8) operations per clock. With 108 SMs, each containing 4 Tensor Cores, and a clock of 1410 MHz, the theoretical throughput is 311.8 TFLOPS ≈ 312 TFLOPS.

To achieve this, workloads must use Tensor Cores and keep all cores busy with 8×4×8 matrix multiplications.

Real‑world utilization is typically around 80 % of the theoretical peak.

2. Quantitative Analysis of Transformer Architecture

2.1 Original Transformer

The original Transformer consists of an Encoder and a Decoder. The Encoder processes input through repeated Multi‑Head Attention and Feed‑Forward blocks with residual connections and layer normalization. The Decoder adds Cross‑Attention to incorporate encoder outputs.

Encoder‑Decoder models are used for translation and multimodal tasks.

Encoder‑Only models (e.g., BERT) are for classification and extraction.

Decoder‑Only models (e.g., Llama, GPT) are for generative tasks.

2.2 GPT‑2 Structure

During inference, input text is tokenized, embedded, and passed through stacked Decoder layers to produce hidden states, which are projected back to vocabulary logits via a transposed embedding matrix and softmax.

2.3 Pre‑processing

Tokenizer converts text to token IDs, which are looked up in the embedding matrix [V, H] to obtain a [B, S, H] tensor. This is a pure memory operation.

2.4 Post‑processing

Logits are computed as ([B, S, H] × [H, V]) resulting in [B, S, V]; softmax yields token probabilities.

2.5 Multi‑Layer Decoder Processing

Each Decoder layer contains Self‑Attention and a Feed‑Forward Network (FFN).

2.6 Single‑Head and Multi‑Head Attention

Attention computes Q, K, V from the input tensor via a linear projection, then performs Q·Kᵀ, softmax, and multiplication with V. The FLOPs for single‑head attention are 2·B·S·d·S' + 2·B·S·S'·d. Multi‑head attention repeats this for h heads, scaling FLOPs by h.

2.7 Parameter and Compute Cost of Attention

Attention has 4·H·H parameters and 8·B·S·H·H FLOPs.

2.8 FFN (MLP) Structure

FFN performs two matrix multiplications ([H,4H] and [4H,H]) with a ReLU activation in between, totaling 8·H·H parameters and 16·B·S·H·H FLOPs.

2.10 Parameter Analysis

For a model with L layers, total parameters N = L·12·H·H + V·H.

2.11 Inference Compute Analysis

Per‑layer inference FLOPs ≈ 2·B·S·4·H·H + 4·B·S·H·S' + 2·B·S·8·H·H. Summed over L layers.

2.12 Inference Memory Consumption

Memory consists of model parameters (N·sizeof(dtype)), KV cache (L·S'·h·d·2·sizeof(dtype)), and intermediate activations (B·S·H·c).

2.13 From Inference to Training

Training adds backward passes (≈2× forward FLOPs) and optimizer state (≈10× parameter memory for Adam).

2.16 MFU Performance Metric

MFU = actual throughput / (theoretical FLOPs per token). For A800 the practical ceiling is about 80 % of 312 TFLOPS.

2.17 Parallelism Strategies

Data parallelism splits batch B, sequence parallelism splits length S, pipeline parallelism splits layers L, and tensor parallelism splits large weight matrices.

3. Transformer Variants

3.1 Attention Optimizations

Multi‑Query Attention (MQA) reduces K and V heads to 1, saving memory. Grouped‑Query Attention (GQA) uses g < h heads. Sliding‑Window Attention limits the attention matrix to a fixed window, reducing compute and memory.

3.2 FFN Optimizations

SwiGLU replaces ReLU with a gated activation, adding an extra matrix multiplication and increasing parameters by ~33 %.

3.3 Mixture‑of‑Experts (MoE)

MoE partitions the FFN into N experts with a routing network, allowing parameter scaling without proportional compute increase.

4. Practice Questions

4.1 Question 1

Using MFU, the 70 B Llama 3 model trained on 15 T tokens with 640 k GPU‑hours achieves about 31.9 % of the H100’s 989 TFLOPS theoretical peak.

4.2 Question 2

The 7 B parameter count comes from L = 32 layers, H = 4096, intermediate size = 11008, and a non‑shared embedding, yielding ≈7 billion parameters. Inference FLOPs ≈ 2 N (≈14 TFLOPS), training FLOPs ≈ 6 N (≈42 TFLOPS).

4.3 Question 3

Eight A800 GPUs (312 TFLOPS each) cannot meet a 600‑700 ms latency for 16 requests of 4000 tokens; the theoretical minimum is ~3.6 s.

4.4 Question 4

With two A800 cards (≈152 GB usable memory) and a 70 B model (≈140 GB), about 12 GB remains for KV cache, allowing roughly 8 concurrent 4000‑token requests.