Why Large‑Model Services Keep Running Out of GPU Memory: An Ops View from KV Cache to Concurrency
The article explains why large‑model inference services frequently hit GPU memory limits, breaks down static vs. dynamic memory consumption, shows how KV‑Cache, request length, and concurrency amplify usage, and provides a step‑by‑step troubleshooting and mitigation workflow for production environments.
Overview
When a large‑model service reports CUDA out of memory, the first instinct is to blame the model size and upgrade the GPU. In practice the memory pressure comes from at least five sources: static model weights, KV‑Cache, request concurrency, context length, and framework or scheduler overhead. A reliable diagnosis must separate static from dynamic usage, examine request profiles, and then verify framework and K8s limits.
Memory Consumption Layers
Static usage includes model weights, compilation caches, and runtime reservations. It is evident when the service starts with high memory consumption.
Dynamic usage consists of KV‑Cache, batch size, concurrent sequences, and long context buffers. These values rise sharply as traffic increases.
Extra overhead such as fragmentation, temporary buffers, and CUDA graphs can cause sudden spikes.
First‑Round Diagnosis Flow
Check whether static usage is already near the GPU limit.
Determine if KV‑Cache has been amplified by long contexts and high concurrency.
Verify that framework parameters (e.g., gpu‑memory‑utilization) are not set too close to the limit.
Inspect K8s scheduling, MIG configuration, node sharing, and resource reclamation.
Key Commands for Observation
# Basic GPU info
nvidia-smi
# Real‑time monitoring
nvidia-smi dmon -s mu -d 2
# K8s pod and deployment details
kubectl top pod -n llm
kubectl logs -n llm deploy/vllm-server --tail=200
# Service metrics
curl -s http://vllm-server:8000/metrics | grep -E 'gpu|kv|request'Static Usage Example
If nvidia‑smi shows 80‑90 % memory usage immediately after startup, any additional request length or concurrency will likely trigger OOM.
gpu_memory_used_bytes 6.8e+10
gpu_memory_total_bytes 8.2e+10KV‑Cache as the Main Amplifier
KV‑Cache grows roughly as:
KV Cache ≈ concurrent_sequences × context_length × per_token_KV_costWhen both concurrency and context length increase, KV‑Cache becomes the dominant memory consumer.
Dynamic Diagnosis Steps
Inspect request logs for input and output token counts (e.g., input_tokens=7420).
Bucket requests (S, M, L, XL) based on token ranges to see if long‑context traffic has surged.
Correlate queue depth and reject counters with memory usage.
Concurrency and Model Parameters
Typical aggressive settings that cause OOM: --max-model-len=8192 (high) --max-num-seqs=128 (high) --gpu-memory-utilization=0.95 (edge)
Even if the average request stays below limits, a burst of long‑context requests can fill the KV‑Cache and crash the service.
Quantization and Tensor Parallelism
Switching from FP16 to AWQ/GPTQ reduces static weight memory but does not eliminate KV‑Cache pressure. Multi‑GPU tensor parallelism distributes weights but KV‑Cache and concurrency remain on each device, so OOM may be delayed rather than solved.
K8s Scheduling Amplifiers
Common factors that amplify memory pressure:
GPU node sharing with other workloads.
Over‑strict liveness probes causing rapid restarts.
HPA that scales only on CPU, ignoring GPU memory.
MIG partitions that are too small for a single instance.
Root‑Cause Matrix (Simplified)
Root Cause Typical Signal Recommended Action
--------------------------------------------------------------------------------
Weight too large Service starts near full memory Switch quantization / smaller model
KV‑Cache blow‑up Long context + high concurrency Bucket traffic, limit length, reduce concurrency
Parameter edge gpu‑memory‑utilization > 0.9 Add safety margin
Tensor‑Parallel misuse High memory despite TP scaling Review TP/shard strategy
K8s scheduling issues Restarts, mixed‑node usage Adjust pod affinity, probes, HPA metricsMitigation Checklist
Calculate static memory usage first, then estimate KV‑Cache budget.
Separate long‑context and short‑query traffic into different pools.
Maintain separate safe‑parameter templates (e.g., lower max‑model‑len for short pool).
Monitor GPU memory, queue depth, and TTFT together; do not rely on GPU util alone.
When OOM occurs, immediately reduce max‑num‑seqs or max‑model‑len as a stop‑gap.
Any permanent changes (quantization, TP, pool redesign) must be validated with load tests before deployment.
Sample Scripts (Retained)
# gpu_mem_snapshot.sh – periodic GPU memory snapshot
#!/usr/bin/env bash
set -euo pipefail
count="${1:-60}"
for ((i=1;i<=count;i++)); do
echo "=== $(date '+%F %T') ==="
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv,noheader
sleep 2
done # llm_request_profile.sh – summarize input/output token distribution
#!/usr/bin/env bash
set -euo pipefail
file="${1:?log file required}"
awk '{
for(i=1;i<=NF;i++){
if($i ~ /^input_tokens=/){split($i,a,"="); in=a[2]}
if($i ~ /^output_tokens=/){split($i,b,"="); out=b[2]}
}
print in, out
}' "$file"Final Recommendations
Memory exhaustion in LLM services is fundamentally a capacity‑budget problem. By first accounting for static weight usage, then sizing KV‑Cache based on realistic concurrency and context lengths, operators can avoid blind upgrades and instead apply targeted actions such as request bucketing, parameter safety margins, and proper K8s scheduling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
