28 min read

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

This guide explains the five common sources of GPU memory consumption in large‑model inference services, provides a step‑by‑step diagnosis workflow—from static usage and KV‑Cache analysis to concurrency and K8s scheduling—offers concrete command‑line checks, scripts, configuration examples, and actionable remediation and monitoring recommendations.

MaGe Linux Operations

Mar 10, 2026

Why Your LLM Service Hits CUDA OOM and How to Diagnose GPU Memory Issues

Overview

When a large‑model service reports CUDA out of memory, the instinctive reaction is to buy a bigger GPU, but the real cause often lies in five distinct memory consumers: static usage (model weights, compile cache, runtime reserve), KV‑Cache, request concurrency, context length, and framework overhead such as fragmentation or temporary buffers. Mis‑attributing the problem to model size alone leads to costly, ineffective fixes.

Memory Consumption Layers

Static usage : model weights, compilation cache, runtime reserve. Signal: high memory right after service start.

Dynamic usage : KV‑Cache, batch size, concurrent requests, long context. Signal: memory spikes as traffic rises.

Extra overhead : fragmentation, temporary buffers, CUDA graph. Signal: irregular memory fluctuations.

Applicable Scenarios

LLM inference services such as vLLM, TGI, SGLang encountering CUDA OOM.

Long‑context requests that keep GPU memory high while utilization stays low.

K8s deployments where scaling or pod restarts fail due to memory pressure.

Multi‑GPU setups where splitting does not eliminate KV‑Cache pressure.

First‑Round Diagnosis Chain

Confirm whether static memory is already near the GPU limit.

Check if KV‑Cache has been amplified by long context and concurrency.

Verify framework parameters and K8s scheduling boundaries.

Environment Matrix (Key Dimensions)

GPU model : A100, H100, L40S, A10, etc.

Deployment framework : vLLM, TGI, SGLang (different parameter names).

Model & precision : FP16, AWQ, GPTQ, FP8 (affects static usage).

Request profile : online chat, long‑context, batch processing (affects KV‑Cache).

Scheduling : single‑node, K8s, MIG, multi‑GPU (affects overall pressure).

Detailed Diagnostic Steps

2.1 Gather Basic Observations

nvidia-smi
nvidia-smi dmon -s mu -d 2
kubectl top pod -n llm
kubectl logs -n llm deploy/vllm-server --tail=200

2.2 Static Usage Breakdown

Static components include model weights, tensor‑parallel sharding, and framework reserve. Example metric extraction:

curl -s http://vllm-server:8000/metrics | grep -E 'gpu|kv|request' | head -40

If the service starts with 80‑90% memory usage, any additional request will likely trigger OOM.

2.3 KV‑Cache Impact

KV‑Cache grows with input_tokens × concurrency × per‑token KV cost. Approximate formula:

total_mem ≈ model_weights + KV_Cache + runtime_reserve + temp_buffer
KV_Cache ≈ concurrent_sequences × context_length × per_token_KV_cost

When both input length and concurrency increase, KV‑Cache becomes the dominant pressure.

2.4 Concurrency Effects

Key parameters to inspect:

--max-model-len=8192
--max-num-seqs=128
--gpu-memory-utilization=0.95

Combining a large max‑model‑len, high max‑num‑seqs, and aggressive memory utilization can cause OOM even if average request size is modest.

2.5 Quantization & Tensor Parallelism

Quantization (FP16 → AWQ/GPTQ/FP8) reduces static weight size but does not eliminate KV‑Cache pressure. Multi‑GPU sharding distributes weights but KV‑Cache and concurrency remain.

python -m vllm.entrypoints.openai.api_server \
  --model /models/Qwen2.5-32B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 4096 \
  --max-num-seqs 16

2.6 K8s Scheduling Amplifiers

Common amplifiers include node mixing, overly strict probes, CPU‑only HPA, and MIG mis‑configuration. These can mask memory pressure or cause repeated restarts.

kubectl get pod -n llm -o wide
kubectl describe pod vllm-server-0 -n llm
kubectl get hpa -n llm
kubectl get pdb -n llm

Root‑Cause Matrix & Recommended Actions

Weight too large / unsuitable quantization : switch quantization, downgrade model size, or upgrade GPU.

KV‑Cache amplification : split request pools, limit context length, reduce concurrency.

Parameters at the edge ( gpu-memory-utilization too high): add safety margin.

Improper multi‑GPU sharding : revisit tensor‑parallel settings.

K8s scheduling issues : adjust probes, HPA metrics, and node pool isolation.

Remediation Actions & Regression Verification

2.8 Immediate “stop‑bleed” Steps

kubectl scale deployment/vllm-server -n llm --replicas=0
kubectl set args deployment/vllm-server -n llm \
  --containers=server \
  -- --model=/models/Qwen2.5-7B-Instruct \
     --max-model-len=4096 \
     --max-num-seqs=32 \
     --gpu-memory-utilization=0.88
kubectl scale deployment/vllm-server -n llm --replicas=2

Request Pooling Example (K8s manifest snippet)

metadata:
  name: llm-long-context
spec:
  template:
    spec:
      nodeSelector:
        pool: gpu-long

Verification Checklist

GPU memory no longer stays >95% for extended periods.

OOM and restart logs are zero.

TTFT and P99 latency return to SLA.

Long‑context requests no longer starve short‑query pool.

Sample Scripts

gpu_mem_snapshot.sh

#!/usr/bin/env bash
set -euo pipefail
count="${1:-60}"
for ((i=1;i<=count;i++)); do
  echo "=== $(date '+%F %T') ==="
  nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv,noheader
  sleep 2
done

llm_request_profile.sh

#!/usr/bin/env bash
set -euo pipefail
file="${1:?log file required}"
awk '{
  for(i=1;i<=NF;i++){
    if($i ~ /^input_tokens=/){split($i,a,"="); in=a[2]}
    if($i ~ /^output_tokens=/){split($i,b,"="); out=b[2]}
  }
  print in, out
}' "$file"

llm_metrics_capture.sh

#!/usr/bin/env bash
set -euo pipefail
url="${1:?metrics url required}"
out="/tmp/llm-metrics-$(date +%F-%H%M%S)"
mkdir -p "$out"
curl -s "$url" > "$out/metrics.txt"
nvidia-smi > "$out/nvidia-smi.txt"

Best Practices & Pitfalls

Perform a capacity budget: static usage + reserve + KV‑Cache budget.

Separate long‑context and short‑query traffic into distinct pools.

Maintain separate safe‑parameter and performance‑parameter templates; avoid on‑the‑fly edge tuning.

Monitor TTFT, queue depth, and memory together; single‑metric alerts miss KV‑Cache pressure.

Beware of common mis‑judgments: low GPU util ≠ no memory issue, quantization only reduces static weight, multi‑GPU does not eliminate dynamic pressure.

On‑Call Hand‑Over Checklist

1. Identify if the issue is static, KV‑Cache, or scheduling related.
2. Verify current max‑model‑len / max‑num‑seqs / gpu‑memory‑utilization.
3. Confirm request pooling (long vs short) is in place.
4. Record any temporary parameter changes for rollback.
5. Ensure OOM and reject counters are cleared after remediation.

Monitoring Recommendations

Collect and alert on the following dimensions:

GPU : memory usage, utilization, temperature – alert if >90% for 10 min.

Service : TTFT, queue depth, reject count – alert on high queue or reject bursts.

Requests : input/output token distribution – alert when long‑context bucket proportion spikes.

K8s : pod restarts, pending pods, HPA scaling – alert on GPU‑only scaling failures.

Conclusion

GPU memory problems in LLM serving are fundamentally a budgeting issue. By separating static and dynamic consumption, analyzing KV‑Cache growth, limiting concurrency and context length, and isolating traffic via request pools, most OOM incidents can be prevented without resorting to larger GPUs. Continuous monitoring of memory, queue, and token‑distribution metrics is essential for proactive remediation.

Performance debugging model quantization GPU memory KV cache LLM OOM

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.