How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

This article breaks down the GPU memory requirements of large language models during training and inference, detailing the contributions of model weights, optimizer states, activations, KV cache, and activation recomputation, and provides concrete formulas, examples, and scaling insights for models like Qwen3 and DeepSeek V3.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How Much GPU Memory Do LLMs Really Need? A Deep Dive into Training & Inference

Memory Usage During Training

During the forward pass each layer’s output (the activation) becomes the next layer’s input, so all intermediate activations must be kept in GPU memory until the backward pass. The total GPU memory required for training can be expressed as:

training_memory = model_weights + optimizer_state + activations + gradients

The optimizer state grows linearly with batch_size, making it a primary memory bottleneck for large‑language‑model (LLM) training. Activation checkpointing (also called activation recomputation) reduces the activation term by discarding selected activations and recomputing them during back‑propagation, at the cost of extra compute.

Memory Usage During Inference

Model‑weight memory

The memory needed to store model parameters depends on the number of parameters params and the data type size ( dtype_size in bytes). For fp16 or bf16, dtype_size = 2 bytes. weight_memory = params * dtype_size // bytes Example: a Qwen3 0.6 B model (≈0.6 billion parameters) in bf16 occupies about 1.2 GB of GPU memory.

Intermediate‑activation memory

Unlike training, inference does not need to retain all forward activations. Only a few temporary tensors (e.g., the result of the most recent layer or the output of an operator) are kept, so the activation overhead is usually < 20 % of the weight memory and can be ignored for large models.

Peak activation memory is dominated by two components:

Self‑attention : the attention‑score matrix (size batch_size × n_head × seq_len × seq_len) and the intermediate V matrix.

MLP (or SwiGLU) block : traditional MLP stores the output of the first linear layer; Qwen3’s SwiGLU block has three linear layers (gate, up, down) and must keep both gate and up outputs simultaneously for the SiLU activation.

# Pseudo‑code for per‑batch activation memory
memory_intermediate_attention = 2 * batch_size * n_head * seq_len**2 + 4 * batch_size * seq_len * hidden_dim
memory_intermediate_mlp = {
    "traditional": 2 * batch_size * seq_len * 4 * hidden_dim,
    "Qwen3_SwiGLU": 2 * batch_size * seq_len * 3 * hidden_dim * 2
}

In practice the attention term dominates for long sequences, while the MLP term dominates for short sequences.

KV‑Cache memory

The KV‑cache stores key and value vectors for each token so that attention for previous positions can be reused. Its size grows linearly with sequence length and batch size.

Traditional multi‑head attention (MHA) :

kv_per_token = 2 * n_head * hidden_dim * dtype_size  // bytes

Qwen3 (Grouped‑Query Attention, GQA) : the number of KV heads is smaller than the number of query heads, reducing the cache proportionally.

kv_per_token = 2 * gqa_heads * hidden_dim * dtype_size  // bytes

DeepSeek V3 (Multi‑Head Latent Attention, MLA) : only two compressed vectors (≈512‑dimensional) are stored per position, yielding a dramatically smaller cache. kv_per_token ≈ 1152 // bytes per token Typical per‑token KV sizes:

Traditional MHA: 4 × n_head × hidden_dim bytes (several KB).

Qwen3 0.6 B (GQA): ≈112 KB per token.

DeepSeek V3 (MLA): ≈69 KB per token.

Total inference memory

The dominant terms are model weights and KV‑cache; activation memory is usually negligible. The overall memory estimate is:

total_inference_memory = weight_memory + kv_cache_memory
# kv_cache_memory = kv_per_token * batch_size * max_sequence_length

Example calculations

Qwen3 0.6 B (bf16) :

Model weights: 1.192 GB

KV‑cache (seq_len = 2048, batch = 1): ≈229 MB

Total ≈ 1.31 GB

DeepSeek V3 (fp16) :

Model weights (unquantized): ≈1.342 TB (requires model‑parallelism or quantization)

W8A8 quantized weights: ≈671 GB (fits on 16 × A100‑80 GB)

KV‑cache (seq_len = 2048): ≈141 MB per request

Batch‑size impact : KV‑cache scales linearly with batch_size. For LLaMA‑13B the cache can be ~1.6 × the model‑parameter memory; for Qwen3 0.6 B the cache quickly exceeds the 1.2 GB weight memory as batch size grows.

Concurrency estimation

When the combined input + output length is large, KV‑cache dominates GPU memory. The per‑request memory is:

request_mem = kv_per_token * max_sequence_length

Example per‑request memory (max = 2048 tokens):

LLaMA‑13B: ≈2 GB → ~200 concurrent requests per 8‑card V100 server for 10 k total.

Qwen3 0.6 B: ≈229 MB → a single RTX 4090 (24 GB) can handle ~100 concurrent requests.

DeepSeek V3 (W8A8): ≈141 MB → 16 × A100‑80 GB can support ~4 300 concurrent requests.

These figures ignore latency constraints and framework overhead; real‑world concurrency is typically lower.

Qualitative and quantitative conclusions

For short contexts the model‑weight memory dominates; for long contexts KV‑cache becomes the main consumer.

Per‑token memory consumption:

Traditional LLMs (e.g., LLaMA‑13B): ~1 MB/token.

Qwen3 0.6 B (GQA): ~112 KB/token.

DeepSeek V3 (MLA): ~69 KB/token.

Activation checkpointing can halve activation memory at the cost of extra compute, a useful trade‑off on memory‑constrained GPUs.

Smaller models (e.g., Qwen3 0.6 B) achieve dramatically lower deployment cost and higher concurrency compared with larger traditional models.

References

1. https://www.usenix.org/conference/osdi20/presentation/gujarati
2. https://github.com/ray-project/llm-numbers#mb-gpu-memory-required-for-1-token-of-output-with-a-13b-parameter-model
3. https://kipp.ly/blog/transformer-inference-arithmetic/
4. https://www.anyscale.com/blog/continuous-batching-llm-inference
5. https://avoid.overfit.cn/post/6724eec842b740d482f73386b1b8b012
6. https://zhuanlan.zhihu.com/p/630832593
7. https://huggingface.co/blog/zh/how-to-generate
8. https://zhuanlan.zhihu.com/p/624740065
9. https://github.com/hahnyuan/LLM-Viewer
LLMPerformance Analysismodel scalingGPU memoryactivation recomputationKV cache
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.