Taming vLLM OOM: Real‑World Causes and Proven Fixes for Production
This article examines why vLLM experiences out‑of‑memory errors in production, explains memory fragmentation caused by PagedAttention, outlines four typical OOM scenarios with concrete command‑line solutions, and provides deep analysis, configuration scripts, dynamic tuning, troubleshooting flowcharts, monitoring alerts, and best‑practice recommendations.
Overview
vLLM introduced PagedAttention in 2023, which partitions KV cache into fixed‑size blocks to reduce memory waste. In production, however, fragmentation and OOM become serious problems, especially under high concurrency and variable‑length requests.
Why fragmentation occurs
Traditional KV cache pre‑allocates
max_seq_len * num_layers * num_heads * head_dim * 2 * dtype_sizebytes per request. If the actual token count is far smaller, memory is wasted, and heterogeneous request sizes make batching difficult.
PagedAttention splits KV cache into 16‑token blocks and allocates them dynamically, but two fragmentation patterns appear:
Internal fragmentation : a request that uses only part of a block leaves most of the block unused (e.g., 17 tokens need two blocks, the second block is 93.75 % empty).
External fragmentation : mixing short and long requests can leave gaps that prevent a new long request from finding a contiguous set of blocks.
Environment
Component Version
vLLM 0.5.4 / 0.6.x
CUDA 12.1
PyTorch 2.3.0
GPU A100 80GB / A800 80GB
Model Qwen2-72B-InstructFour Typical OOM Scenarios
1. Startup OOM
vLLM loads model weights and pre‑allocates the KV‑cache block pool. The default --gpu-memory-utilization of 0.90 can fill the GPU, leaving insufficient space for the cache.
Solution:
# Option 1: limit GPU memory utilization
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-72B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.85 # safe value
# Option 2: CPU offload (reduces performance)
python -m vllM.entrypoints.openai.api_server \
--model /data/models/Qwen2-72B-Instruct \
--tensor-parallel-size 4 \
--cpu-offload-gb 202. Runtime burst OOM
Sudden influx of long prompts, many prefill requests, or a few extremely long contexts can exceed the KV‑cache pool.
Example log:
2024-03-15 14:32:45 ERROR vllm.worker: CUDA out of memory during forward pass
2024-03-15 14:32:45 ERROR vllm.engine: Request req_12345 failed: OOMSolution: limit the number of tokens processed per batch.
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-72B-Instruct \
--tensor-parallel-size 4 \
--max-num-seqs 32 \
--max-model-len 32768 \
--max-num-batched-tokens 16384The flag --max-num-batched-tokens caps the token count per forward pass, preventing OOM at the cost of reduced throughput. Recommended values based on GPU memory are:
GPU Memory Recommended max‑num‑batched‑tokens
24 GB 4096
40 GB 8192
80 GB 16384‑245763. Chronic OOM after long‑running
After days of operation the PyTorch CUDA allocator fragments, even though vLLM uses block‑level management.
Diagnostic script (prints fragmentation metrics):
import torch
def print_cuda_memory_stats():
for i in range(torch.cuda.device_count()):
print(f"GPU {i}:")
print(f" Allocated: {torch.cuda.memory_allocated(i)/1024**3:.2f} GB")
print(f" Reserved: {torch.cuda.memory_reserved(i)/1024**3:.2f} GB")
print(f" Free: {(torch.cuda.memory_reserved(i)-torch.cuda.memory_allocated(i))/1024**3:.2f} GB")
stats = torch.cuda.memory_stats(i)
frag = stats.get('active_bytes.all.peak',0) / stats.get('reserved_bytes.all.peak',1)
print(f" Fragmentation: {frag:.2%}")If fragmentation exceeds ~70 %, restart the service or adjust the allocator:
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-72B-Instruct \
--enable-prefix-caching \
...4. Prefix‑caching OOM
Prefix caching keeps KV blocks for identical prompts. Low hit‑rate with many distinct prefixes can exhaust memory.
Inspect cache usage:
import requests
metrics = requests.get("http://localhost:8000/metrics").text
for line in metrics.split("
"):
if "prefix_cache" in line:
print(line)Typical output:
vllm_prefix_cache_blocks_used 12345
vllm_prefix_cache_blocks_total 20000
vllm_prefix_cache_hit_rate 0.23Solutions:
Disable prefix caching when hit‑rate < 0.3.
Limit cache size via --max-num-seqs or avoid enabling the feature.
Deep Dive into Fragmentation
Block table structure
Request A: [0, 1, 5, 9] # uses 4 blocks
Request B: [2, 3, 4, 6, 7, 8] # uses 6 blocks
Request C: [10, 11] # uses 2 blocksWhen Request A finishes, its blocks become free but are non‑contiguous, which can affect CUDA copy efficiency, L2 cache hit rate, and TLB misses.
Quantifying fragmentation
def analyze_fragmentation(vllm_url="http://localhost:8000"):
metrics = requests.get(f"{vllm_url}/metrics").text
# parse key metrics …
total = metrics_dict.get("vllm_num_gpu_blocks",0)
used = metrics_dict.get("vllm_num_gpu_blocks_used",0)
free = total - used
running = metrics_dict.get("vllm_num_requests_running",0)
avg_blocks = used / max(running,1)
potential = free / max(avg_blocks,1)
print(f"Total blocks: {total}")
print(f"Used blocks: {used}")
print(f"Free blocks: {free}")
print(f"Running requests: {running}")
print(f"Avg blocks/request: {avg_blocks:.1f}")
print(f"Potential new requests: {potential:.1f}")
print(f"Memory utilization: {used/total*100:.1f}%")
return {"total_blocks":total,"used_blocks":used,"free_blocks":free,"utilization":used/total}Choosing block size
Default block size = 16 tokens. Trade‑offs:
Block Size | Advantages | Disadvantages
-----------|---------------------|-----------------
8 | Less internal waste| Larger block table, higher overhead
16 | Balanced | —
32 | Lower management cost| More internal wasteGuidelines:
Use small block size (8) when request lengths vary widely (e.g., 100 → 10 000 tokens).
Use default (16) or larger (32) when lengths are relatively uniform.
Example to reduce internal fragmentation:
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-72B-Instruct \
--block-size 8 # less internal waste
...Remember to adjust --gpu-memory-utilization accordingly because block size changes total KV‑cache capacity.
Production‑Ready Configuration
Startup script
#!/bin/bash
set -e
export CUDA_VISIBLE_DEVICES=0,1,2,3
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8"
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
MODEL_PATH="/data/models/Qwen2-72B-Instruct"
TP_SIZE=4
PORT=8000
GPU_MEMORY_UTILIZATION=0.88 # A100 80GB
MAX_MODEL_LEN=32768
MAX_NUM_SEQS=64
MAX_NUM_BATCHED_TOKENS=24576
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--tensor-parallel-size $TP_SIZE \
--port $PORT \
--host 0.0.0.0 \
--gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
--max-model-len $MAX_MODEL_LEN \
--max-num-seqs $MAX_NUM_SEQS \
--max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
--block-size 16 \
--swap-space 8 \
--disable-log-requests \
--enable-chunked-prefill \
--max-num-on-the-fly-batches 4 \
--dtype bfloat16 \
--trust-remote-code \
2>&1 | tee -a /var/log/vllm/vllm_$(date +%Y%m%d).logKey parameters explained
--swap-space 8 : Enables swapping unused KV blocks to CPU (8 GB).
--enable-chunked-prefill : Splits long prompts into chunks to avoid a single huge prefill.
--max-num-on-the-fly-batches 4 : Controls concurrent GPU batches; higher values increase memory pressure.
Dynamic tuning sidecar
import requests, time, subprocess
class VLLMTuner:
def __init__(self, url="http://localhost:8000"):
self.url = url
self.history = []
def get_metrics(self):
try:
r = requests.get(f"{self.url}/metrics", timeout=5)
return self.parse_metrics(r.text)
except:
return None
def parse_metrics(self, txt):
d = {}
for line in txt.split("
"):
if line and not line.startswith("#"):
parts = line.split()
if len(parts) >= 2:
d[parts[0]] = float(parts[1])
return d
def should_restart(self, m):
if not m:
return False
if m.get("vllm_gpu_cache_usage_perc",0) > 0.95:
self.history.append(("high_memory", time.time()))
if m.get("vllm_num_requests_waiting",0) > m.get("vllm_num_requests_running",0)*2:
self.history.append(("queue_backlog", time.time()))
recent = [h for h in self.history if time.time()-h[1] < 300]
return len(recent) >= 3
def graceful_restart(self):
print("Initiating graceful restart...")
for _ in range(60):
m = self.get_metrics()
if m and m.get("vllm_num_requests_running",0) == 0:
break
time.sleep(1)
subprocess.run(["systemctl","restart","vllm"])
print("Restart completed")
def run(self):
while True:
m = self.get_metrics()
if m:
self.log_status(m)
if self.should_restart(m):
self.graceful_restart()
self.history = []
time.sleep(30)
def log_status(self, m):
print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] Running:{m.get('vllm_num_requests_running',0)} "
f"Waiting:{m.get('vllm_num_requests_waiting',0)} "
f"GPU Cache:{m.get('vllm_gpu_cache_usage_perc',0)*100:.1f}%")
if __name__ == "__main__":
VLLMTuner().run()Troubleshooting Flowchart
OOM occurs
│
├─→ Startup OOM?
│ ├─ Yes → lower --gpu-memory-utilization
│ └─ No ↓
│
├─→ Burst OOM?
│ ├─ Yes → check for ultra‑long requests, lower --max-num-batched-tokens
│ └─ No ↓
│
├─→ Chronic OOM?
│ ├─ Yes → inspect CUDA fragmentation, consider periodic restart
│ └─ No ↓
│
└─→ Prefix‑caching OOM? → evaluate hit‑rate, disable or cap cacheCommon diagnostic commands
nvidia-smi pmon -s um -d 1– GPU memory usage. python -c "import torch; print(torch.cuda.memory_summary())" – detailed CUDA stats.
curl -s http://localhost:8000/metrics | grep -E "(vllm_num|vllm_gpu)"– vLLM metrics.
watch -n 1 'nvidia-smi --query-gpu=index,memory.used,memory.total,utilization.gpu --format=csv'– real‑time GPU monitor.
grep -i "oom\|memory\|cuda" /var/log/vllm/vllm_*.log | tail -50– recent OOM logs.
Emergency recovery
# Kill stray vLLM processes
pkill -9 -f "vllm.entrypoints"
# Reset GPU memory (use with caution)
nvidia-smi --gpu-reset -i 0,1,2,3
# Clear CUDA cache
python -c "import torch; torch.cuda.empty_cache()"
# Restart with conservative config
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2-72B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.75 \
--max-num-seqs 16 \
--max-num-batched-tokens 8192Monitoring & Alerting
Prometheus alerts (example)
groups:
- name: vllm
rules:
- alert: VLLMHighMemoryUsage
expr: vllm_gpu_cache_usage_perc > 0.90
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM GPU cache usage is high"
description: "GPU cache usage is {{ $value | humanizePercentage }}"
- alert: VLLMCriticalMemoryUsage
expr: vllm_gpu_cache_usage_perc > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "vLLM GPU cache usage is critical"
- alert: VLLMQueueBacklog
expr: vllm_num_requests_waiting > vllm_num_requests_running * 3
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM request queue is backing up"
- alert: VLLMNoRequests
expr: rate(vllm_request_success_total[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "vLLM is not processing any requests"Grafana dashboard (core panels)
{
"title": "vLLM Performance",
"panels": [
{"title":"GPU Memory Usage","type":"gauge","targets":[{"expr":"vllm_gpu_cache_usage_perc * 100"}]},
{"title":"Request Queue","type":"timeseries","targets":[{"expr":"vllm_num_requests_running","legendFormat":"Running"},{"expr":"vllm_num_requests_waiting","legendFormat":"Waiting"}]},
{"title":"Token Throughput","type":"timeseries","targets":[{"expr":"rate(vllm_generation_tokens_total[1m])","legendFormat":"tokens/s"}]},
{"title":"Request Latency P99","type":"timeseries","targets":[{"expr":"histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m]))","legendFormat":"P99"}]}
]
}Conclusion
PagedAttention reduces KV‑cache waste but introduces internal and external fragmentation that can trigger OOM in production. Preventive measures include tuning --gpu-memory-utilization and --max-num-batched‑tokens, monitoring GPU cache usage and fragmentation, and applying the concrete configuration and restart strategies described above. Continuous profiling and adjustment are required because optimal settings depend on workload characteristics and hardware.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
