How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting
This article breaks down GPU memory consumption for large Transformer models, explains how to estimate each component—parameters, optimizer state, activations, gradients—and shows how parallelism, mixed precision, and recomputation strategies can dramatically reduce the footprint.
When AI algorithms run on servers, a frequent question is "How many model parameters can a single GPU hold?" The answer depends on model architecture, framework, driver version, and GPU hardware. This guide focuses on large‑model training and inference, presenting a systematic way to calculate and optimise GPU memory usage.
1. Memory Composition in Training/Inference
GPU global memory is split between the AI framework and the system driver. The framework‑controlled portion is user‑controllable and is the focus of this analysis. Tools like nvidia‑smi can report per‑process memory usage.
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory Usage |
| ID ID Usage |
|=======================================================================================|
| 1 N/A N/A 67321 C .../anaconda3/envs/py/bin/python 23646MiB |
| 1 N/A N/A 71612 C .../anaconda3/envs/py/bin/python 848MiB |
| 2 N/A N/A 67321 C .../anaconda3/envs/py/bin/python 25776MiB |
+---------------------------------------------------------------------------------------+The framework‑side memory consists of several categories:
Model parameters (parameter)
Optimizer state (optimizer_state)
Activations (activation)
Gradients (gradient)
Input/Output data (input)
Temporary variables (temporary)
Autograd internals (autograd_detail)
Unknown/uncategorised data (unknown)
From a user perspective, these can be grouped into estimable values (parameter, optimizer_state, activation, gradient, input), unnamed data (temporary, unknown), and framework‑generated data (autograd_detail).
2. Estimation Formulas for Training Scenarios
2.1 Model Memory
The memory occupied by the model itself is proportional to the number of parameters and their data type (fp32, fp16/bf16, int8, fp8, etc.). For a 1‑billion‑parameter model stored in fp32, the checkpoint size is roughly 4 GB, implying a similar order of magnitude for GPU memory.
2.2 Optimizer State
For Adam, each parameter stores a momentum and a variance tensor, plus a possible fp16 master copy. The per‑parameter memory (in GB) can be expressed as: (4 + 4 + 4) × params / (1024³) for fp32, where the three 4‑byte terms correspond to the model copy, momentum, and variance.
2.3 Gradient Memory
Gradients share the same datatype as the model parameters, so their memory is calculated similarly.
2.4 Activation Memory
Activation size depends on hidden dimension, sequence length, number of attention heads, and parallelism settings. Following the Megatron‑LM paper, the activation memory (GB) is: 2 × L × h × s × a × λ / t s – sequence length (tokens)
b – micro‑batch size
h – hidden dimension size
a – number of attention heads
t – tensor‑parallel degree
L – number of transformer layers
λ – 1 / (1024³) for fp16
2.5 Parallelism Impact
Single‑GPU memory often cannot hold a large model, so parallel strategies are employed:
Tensor Parallel (TP)
Sequence Parallel (SP)
Pipeline Parallel (PP)
Zero Redundancy Optimizer (Zero 1/2/3)
Recomputation (checkpointing)
These techniques reduce the memory needed for parameters, activations, and gradients. For example, the memory formula for a model with TP, PP, and Zero‑1 becomes:
(Model + Optimizer + Gradient + Activation) / (TP × PP) + Zero‑overhead2.6 Inference Memory
Inference memory is simpler: total memory ≈ model parameters + activations (no optimizer state or gradients). A concise formula is provided in the original blog.
3. Practical Optimization Roadmap
Memory optimisation can be approached from the top‑down:
Apply multi‑GPU parallelism (TP/SP/PP/Zero).
Choose lower‑precision operators (fp16, int8, etc.).
Eliminate unnecessary framework copies.
Use memory‑management tricks (e.g., PyTorch's torch.cuda.empty_cache() and custom allocators).
Replace high‑memory kernels with efficient alternatives (e.g., FlashAttention).
Each step trades off compute, bandwidth, or latency against memory savings.
4. Conclusion
By breaking down GPU memory into its constituent parts and applying the presented formulas, engineers can accurately estimate memory requirements, identify the biggest contributors, and select appropriate parallelism or precision strategies to fit large models onto limited hardware.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
