Operations 26 min read

Why Large‑Model Services Keep Running Out of GPU Memory: An Ops View from KV Cache to Concurrency

The article explains why large‑model inference services frequently hit GPU memory limits, breaks down static vs. dynamic memory consumption, shows how KV‑Cache, request length, and concurrency amplify usage, and provides a step‑by‑step troubleshooting and mitigation workflow for production environments.

Raymond Ops
Raymond Ops
Raymond Ops
Why Large‑Model Services Keep Running Out of GPU Memory: An Ops View from KV Cache to Concurrency

Overview

When a large‑model service reports CUDA out of memory, the first instinct is to blame the model size and upgrade the GPU. In practice the memory pressure comes from at least five sources: static model weights, KV‑Cache, request concurrency, context length, and framework or scheduler overhead. A reliable diagnosis must separate static from dynamic usage, examine request profiles, and then verify framework and K8s limits.

Memory Consumption Layers

Static usage includes model weights, compilation caches, and runtime reservations. It is evident when the service starts with high memory consumption.

Dynamic usage consists of KV‑Cache, batch size, concurrent sequences, and long context buffers. These values rise sharply as traffic increases.

Extra overhead such as fragmentation, temporary buffers, and CUDA graphs can cause sudden spikes.

First‑Round Diagnosis Flow

Check whether static usage is already near the GPU limit.

Determine if KV‑Cache has been amplified by long contexts and high concurrency.

Verify that framework parameters (e.g., gpu‑memory‑utilization) are not set too close to the limit.

Inspect K8s scheduling, MIG configuration, node sharing, and resource reclamation.

Key Commands for Observation

# Basic GPU info
nvidia-smi
# Real‑time monitoring
nvidia-smi dmon -s mu -d 2
# K8s pod and deployment details
kubectl top pod -n llm
kubectl logs -n llm deploy/vllm-server --tail=200
# Service metrics
curl -s http://vllm-server:8000/metrics | grep -E 'gpu|kv|request'

Static Usage Example

If nvidia‑smi shows 80‑90 % memory usage immediately after startup, any additional request length or concurrency will likely trigger OOM.

gpu_memory_used_bytes 6.8e+10
gpu_memory_total_bytes 8.2e+10

KV‑Cache as the Main Amplifier

KV‑Cache grows roughly as:

KV Cache ≈ concurrent_sequences × context_length × per_token_KV_cost

When both concurrency and context length increase, KV‑Cache becomes the dominant memory consumer.

Dynamic Diagnosis Steps

Inspect request logs for input and output token counts (e.g., input_tokens=7420).

Bucket requests (S, M, L, XL) based on token ranges to see if long‑context traffic has surged.

Correlate queue depth and reject counters with memory usage.

Concurrency and Model Parameters

Typical aggressive settings that cause OOM: --max-model-len=8192 (high) --max-num-seqs=128 (high) --gpu-memory-utilization=0.95 (edge)

Even if the average request stays below limits, a burst of long‑context requests can fill the KV‑Cache and crash the service.

Quantization and Tensor Parallelism

Switching from FP16 to AWQ/GPTQ reduces static weight memory but does not eliminate KV‑Cache pressure. Multi‑GPU tensor parallelism distributes weights but KV‑Cache and concurrency remain on each device, so OOM may be delayed rather than solved.

K8s Scheduling Amplifiers

Common factors that amplify memory pressure:

GPU node sharing with other workloads.

Over‑strict liveness probes causing rapid restarts.

HPA that scales only on CPU, ignoring GPU memory.

MIG partitions that are too small for a single instance.

Root‑Cause Matrix (Simplified)

Root Cause                Typical Signal                     Recommended Action
--------------------------------------------------------------------------------
Weight too large         Service starts near full memory   Switch quantization / smaller model
KV‑Cache blow‑up         Long context + high concurrency   Bucket traffic, limit length, reduce concurrency
Parameter edge           gpu‑memory‑utilization > 0.9      Add safety margin
Tensor‑Parallel misuse   High memory despite TP scaling    Review TP/shard strategy
K8s scheduling issues   Restarts, mixed‑node usage         Adjust pod affinity, probes, HPA metrics

Mitigation Checklist

Calculate static memory usage first, then estimate KV‑Cache budget.

Separate long‑context and short‑query traffic into different pools.

Maintain separate safe‑parameter templates (e.g., lower max‑model‑len for short pool).

Monitor GPU memory, queue depth, and TTFT together; do not rely on GPU util alone.

When OOM occurs, immediately reduce max‑num‑seqs or max‑model‑len as a stop‑gap.

Any permanent changes (quantization, TP, pool redesign) must be validated with load tests before deployment.

Sample Scripts (Retained)

# gpu_mem_snapshot.sh – periodic GPU memory snapshot
#!/usr/bin/env bash
set -euo pipefail
count="${1:-60}"
for ((i=1;i<=count;i++)); do
  echo "=== $(date '+%F %T') ==="
  nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv,noheader
  sleep 2
done
# llm_request_profile.sh – summarize input/output token distribution
#!/usr/bin/env bash
set -euo pipefail
file="${1:?log file required}"
awk '{
  for(i=1;i<=NF;i++){
    if($i ~ /^input_tokens=/){split($i,a,"="); in=a[2]}
    if($i ~ /^output_tokens=/){split($i,b,"="); out=b[2]}
  }
  print in, out
}' "$file"

Final Recommendations

Memory exhaustion in LLM services is fundamentally a capacity‑budget problem. By first accounting for static weight usage, then sizing KV‑Cache based on realistic concurrency and context lengths, operators can avoid blind upgrades and instead apply targeted actions such as request bucketing, parameter safety margins, and proper K8s scheduling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference OptimizationKubernetesLarge Language Modelscapacity planningGPU memoryKV cache
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.