Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues
This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.
Overview
When a large‑model service reports CUDA out of memory, the instinctive reaction is to buy a bigger GPU, but the real cause often lies in five distinct memory consumers: static usage (model weights, compile cache, runtime reserve), KV‑Cache, request concurrency, context length, and framework overhead such as fragmentation or temporary buffers. Mis‑attributing the problem to model size alone leads to costly, ineffective fixes.
Memory Consumption Layers
Static usage : model weights, compilation cache, runtime reserve. Signal: high memory right after service start.
Dynamic usage : KV‑Cache, batch size, concurrent requests, long context. Signal: memory spikes as traffic rises.
Extra overhead : fragmentation, temporary buffers, CUDA graph. Signal: irregular memory fluctuations.
Applicable Scenarios
LLM inference services such as vLLM, TGI, SGLang encountering CUDA OOM.
Long‑context requests that keep GPU memory high while utilization stays low.
K8s deployments where scaling or pod restarts fail due to memory pressure.
Multi‑GPU setups where splitting does not eliminate KV‑Cache pressure.
First‑Round Diagnosis Chain
Confirm whether static memory is already near the GPU limit.
Check if KV‑Cache has been amplified by long context and concurrency.
Verify framework parameters and K8s scheduling boundaries.
Environment Matrix (Key Dimensions)
GPU model : A100, H100, L40S, A10, etc.
Deployment framework : vLLM, TGI, SGLang (different parameter names).
Model & precision : FP16, AWQ, GPTQ, FP8 (affects static usage).
Request profile : online chat, long‑context, batch processing (affects KV‑Cache).
Scheduling : single‑node, K8s, MIG, multi‑GPU (affects overall pressure).
Detailed Diagnostic Steps
2.1 Gather Basic Observations
nvidia-smi
nvidia-smi dmon -s mu -d 2
kubectl top pod -n llm
kubectl logs -n llm deploy/vllm-server --tail=2002.2 Static Usage Breakdown
Static components include model weights, tensor‑parallel sharding, and framework reserve. Example metric extraction:
curl -s http://vllm-server:8000/metrics | grep -E 'gpu|kv|request' | head -40If the service starts with 80‑90% memory usage, any additional request will likely trigger OOM.
2.3 KV‑Cache Impact
KV‑Cache grows with input_tokens × concurrency × per‑token KV cost. Approximate formula:
total_mem ≈ model_weights + KV_Cache + runtime_reserve + temp_buffer
KV_Cache ≈ concurrent_sequences × context_length × per_token_KV_costWhen both input length and concurrency increase, KV‑Cache becomes the dominant pressure.
2.4 Concurrency Effects
Key parameters to inspect:
--max-model-len=8192
--max-num-seqs=128
--gpu-memory-utilization=0.95Combining a large max‑model‑len, high max‑num‑seqs, and aggressive memory utilization can cause OOM even if average request size is modest.
2.5 Quantization & Tensor Parallelism
Quantization (FP16 → AWQ/GPTQ/FP8) reduces static weight size but does not eliminate KV‑Cache pressure. Multi‑GPU sharding distributes weights but KV‑Cache and concurrency remain.
python -m vllm.entrypoints.openai.api_server \
--model /models/Qwen2.5-32B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 4096 \
--max-num-seqs 162.6 K8s Scheduling Amplifiers
Common amplifiers include node mixing, overly strict probes, CPU‑only HPA, and MIG mis‑configuration. These can mask memory pressure or cause repeated restarts.
kubectl get pod -n llm -o wide
kubectl describe pod vllm-server-0 -n llm
kubectl get hpa -n llm
kubectl get pdb -n llmRoot‑Cause Matrix & Recommended Actions
Weight too large / unsuitable quantization : switch quantization, downgrade model size, or upgrade GPU.
KV‑Cache amplification : split request pools, limit context length, reduce concurrency.
Parameters at the edge ( gpu-memory-utilization too high): add safety margin.
Improper multi‑GPU sharding : revisit tensor‑parallel settings.
K8s scheduling issues : adjust probes, HPA metrics, and node pool isolation.
Remediation Actions & Regression Verification
2.8 Immediate “stop‑bleed” Steps
kubectl scale deployment/vllm-server -n llm --replicas=0
kubectl set args deployment/vllm-server -n llm \
--containers=server \
-- --model=/models/Qwen2.5-7B-Instruct \
--max-model-len=4096 \
--max-num-seqs=32 \
--gpu-memory-utilization=0.88
kubectl scale deployment/vllm-server -n llm --replicas=2Request Pooling Example (K8s manifest snippet)
metadata:
name: llm-long-context
spec:
template:
spec:
nodeSelector:
pool: gpu-longVerification Checklist
GPU memory no longer stays >95% for extended periods.
OOM and restart logs are zero.
TTFT and P99 latency return to SLA.
Long‑context requests no longer starve short‑query pool.
Sample Scripts
gpu_mem_snapshot.sh
#!/usr/bin/env bash
set -euo pipefail
count="${1:-60}"
for ((i=1;i<=count;i++)); do
echo "=== $(date '+%F %T') ==="
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv,noheader
sleep 2
donellm_request_profile.sh
#!/usr/bin/env bash
set -euo pipefail
file="${1:?log file required}"
awk '{
for(i=1;i<=NF;i++){
if($i ~ /^input_tokens=/){split($i,a,"="); in=a[2]}
if($i ~ /^output_tokens=/){split($i,b,"="); out=b[2]}
}
print in, out
}' "$file"llm_metrics_capture.sh
#!/usr/bin/env bash
set -euo pipefail
url="${1:?metrics url required}"
out="/tmp/llm-metrics-$(date +%F-%H%M%S)"
mkdir -p "$out"
curl -s "$url" > "$out/metrics.txt"
nvidia-smi > "$out/nvidia-smi.txt"Best Practices & Pitfalls
Perform a capacity budget: static usage + reserve + KV‑Cache budget.
Separate long‑context and short‑query traffic into distinct pools.
Maintain separate safe‑parameter and performance‑parameter templates; avoid on‑the‑fly edge tuning.
Monitor TTFT, queue depth, and memory together; single‑metric alerts miss KV‑Cache pressure.
Beware of common mis‑judgments: low GPU util ≠ no memory issue, quantization only reduces static weight, multi‑GPU does not eliminate dynamic pressure.
On‑Call Hand‑Over Checklist
1. Identify if the issue is static, KV‑Cache, or scheduling related.
2. Verify current max‑model‑len / max‑num‑seqs / gpu‑memory‑utilization.
3. Confirm request pooling (long vs short) is in place.
4. Record any temporary parameter changes for rollback.
5. Ensure OOM and reject counters are cleared after remediation.Monitoring Recommendations
Collect and alert on the following dimensions:
GPU : memory usage, utilization, temperature – alert if >90% for 10 min.
Service : TTFT, queue depth, reject count – alert on high queue or reject bursts.
Requests : input/output token distribution – alert when long‑context bucket proportion spikes.
K8s : pod restarts, pending pods, HPA scaling – alert on GPU‑only scaling failures.
Conclusion
GPU memory problems in LLM serving are fundamentally a budgeting issue. By separating static and dynamic consumption, analyzing KV‑Cache growth, limiting concurrency and context length, and isolating traffic via request pools, most OOM incidents can be prevented without resorting to larger GPUs. Continuous monitoring of memory, queue, and token‑distribution metrics is essential for proactive remediation.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
