vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM
This article analyzes why vLLM's PagedAttention can cause GPU memory fragmentation and out‑of‑memory errors in production, presents four typical OOM scenarios, and provides concrete diagnostics, configuration tweaks, code examples, and monitoring strategies to eliminate the problem.
Overview
vLLM introduced PagedAttention to reduce KV cache memory waste by paging fixed‑size blocks (default 16 tokens). In production, memory fragmentation and OOM appear.
Why fragmentation occurs
Traditional KV cache pre‑allocates
max_seq_len * num_layers * num_heads * head_dim * 2 * dtype_sizeper request, causing waste when the actual token count is lower. PagedAttention splits KV cache into blocks, but leads to:
Internal fragmentation : a request of 17 tokens needs two 16‑token blocks; the second block uses only 1/16 of its capacity (93.75 % wasted).
External fragmentation : mixing short and long requests leaves gaps that prevent a new request from obtaining the required number of blocks, even though the kernel can handle non‑contiguous memory.
Four typical OOM scenarios
Startup OOM
vLLM pre‑allocates as many blocks as possible before any request, exhausting GPU memory. Example error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 79.35 GiB total capacity; 75.21 GiB already allocated; 1.83 GiB free; 76.50 GiB reserved)Solution: limit --gpu-memory-utilization (e.g., 0.85) or enable CPU offload.
Sudden runtime OOM
Occurs when a burst of long prompts or many concurrent prefill requests exceeds KV cache. Example log:
2024-03-15 14:32:45 ERROR vllm.worker: CUDA out of memory during forward pass
2024-03-15 14:32:45 ERROR vllm.engine: Request req_12345 failed: OOMSolution: limit --max-num-batched-tokens (e.g., 16384) and adjust --max-num-seqs to control batch size.
Chronic OOM after days
Long‑running services accumulate CUDA allocator fragmentation because vLLM relies on PyTorch’s allocator. Diagnostic script prints torch.cuda.memory_summary and computes a fragmentation percentage; values below 70 % indicate severe fragmentation.
for i in range(torch.cuda.device_count()):
print(f"GPU {i}: {torch.cuda.memory_summary(i)}")Solutions:
Periodic service restart (e.g., daily CronJob).
Set PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" to allow expandable segments.
Enable vLLM’s built‑in memory compaction (v0.5.0+).
Prefix caching OOM
Prefix caching keeps KV cache for repeated prompts. With many distinct prefixes the cache grows unbounded. Example metrics query:
import requests
response = requests.get("http://localhost:8000/metrics")
for line in response.text.split("
"):
if "prefix_cache" in line:
print(line)Typical output:
vllm_prefix_cache_blocks_used 12345
vllm_prefix_cache_blocks_total 20000
vllm_prefix_cache_hit_rate 0.23Solutions:
Disable prefix caching when hit rate < 0.3 (omit --enable-prefix-caching).
Limit cache size indirectly by reducing --max-num-seqs.
Deep analysis of memory fragmentation
Block table structure
Request A: [0, 1, 5, 9] # 4 blocks
Request B: [2, 3, 4, 6, 7, 8] # 6 blocks
Request C: [10, 11] # 2 blocksWhen a request finishes, its blocks become free but may be non‑contiguous, affecting cache efficiency.
Impact of non‑contiguous blocks
Reduced CUDA memory copy efficiency.
Lower L2 cache hit rate.
Increased TLB misses.
Quantifying fragmentation
Script print_cuda_memory_stats prints allocated, reserved, and free memory and computes fragmentation percentage. Example snippet:
for i in range(torch.cuda.device_count()):
print(f"GPU {i}: {torch.cuda.memory_summary(i)}")If the fragmentation metric is below 70 %, the allocator is heavily fragmented.
Choosing block size
8‑token blocks : minimal internal fragmentation, but larger block table and higher management overhead.
16‑token blocks (default) : balanced trade‑off.
32‑token blocks : lower management overhead, but more internal fragmentation.
Recommendation: use 8‑token blocks for workloads with highly variable context lengths; otherwise keep 16 or increase to 32 for uniform workloads.
Production‑ready configuration
Startup script
#!/bin/bash
set -e
export CUDA_VISIBLE_DEVICES=0,1,2,3
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8"
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
MODEL_PATH="/data/models/Qwen2-72B-Instruct"
TP_SIZE=4
PORT=8000
GPU_MEMORY_UTILIZATION=0.88
MAX_MODEL_LEN=32768
MAX_NUM_SEQS=64
MAX_NUM_BATCHED_TOKENS=24576
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--tensor-parallel-size $TP_SIZE \
--port $PORT \
--host 0.0.0.0 \
--gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
--max-model-len $MAX_MODEL_LEN \
--max-num-seqs $MAX_NUM_SEQS \
--max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
--block-size 16 \
--swap-space 8 \
--disable-log-requests \
--enable-chunked-prefill \
--max-num-on-the-fly-batches 4 \
--dtype bfloat16 \
--trust-remote-code \
2>&1 | tee -a /var/log/vllm/vllm_$(date +%Y%m%d).logKey parameters explained
--swap-space 8: allocates an 8 GB CPU swap for KV cache, allowing low‑priority requests to be swapped out. --enable-chunked-prefill: splits long prompts into chunks, preventing a single prefill from exhausting GPU memory (introduced in v0.4.0). --max-num-on-the-fly-batches 4: limits the number of concurrent GPU batches; higher values increase memory pressure.
Dynamic configuration sidecar
A Python sidecar monitors vLLM metrics and triggers a graceful restart when GPU cache usage stays above 95 % or the request backlog exceeds twice the running count for more than five minutes.
import requests, time, subprocess
class VLLMTuner:
def __init__(self, url="http://localhost:8000"):
self.url = url
self.history = []
def get_metrics(self):
try:
r = requests.get(f"{self.url}/metrics", timeout=5)
return self.parse_metrics(r.text)
except:
return None
def parse_metrics(self, text):
metrics = {}
for line in text.split("
"):
if line and not line.startswith("#"):
parts = line.split()
if len(parts) >= 2:
metrics[parts[0]] = float(parts[1])
return metrics
def should_restart(self, m):
if not m:
return False
gpu = m.get("vllm_gpu_cache_usage_perc", 0)
if gpu > 0.95:
self.history.append(("high_memory", time.time()))
waiting = m.get("vllm_num_requests_waiting", 0)
running = m.get("vllm_num_requests_running", 0)
if waiting > running * 2:
self.history.append(("queue_backlog", time.time()))
recent = [h for h in self.history if time.time() - h[1] < 300]
return len(recent) >= 3
def graceful_restart(self):
print("Initiating graceful restart...")
# Drain traffic via load balancer (implementation‑specific)
for _ in range(60):
m = self.get_metrics()
if m and m.get("vllm_num_requests_running", 0) == 0:
break
time.sleep(1)
subprocess.run(["systemctl", "restart", "vllm"])
print("Restart completed")
def run(self):
while True:
m = self.get_metrics()
if m:
self.log_status(m)
if self.should_restart(m):
self.graceful_restart()
self.history = []
time.sleep(30)
def log_status(self, m):
print(f"[{{time.strftime('%Y-%m-%d %H:%M:%S')}}] Running:{m.get('vllm_num_requests_running',0)} "
f"Waiting:{m.get('vllm_num_requests_waiting',0)} "
f"GPU Cache:{m.get('vllm_gpu_cache_usage_perc',0)*100:.1f}%")
if __name__ == "__main__":
VLLMTuner().run()Monitoring and alerting
Prometheus alerts
# High GPU cache usage
groups:
- name: vllm
rules:
- alert: VLLMHighMemoryUsage
expr: vllm_gpu_cache_usage_perc > 0.90
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM GPU cache usage is high"
description: "GPU cache usage is {{ $value | humanizePercentage }}"
- alert: VLLMCriticalMemoryUsage
expr: vllm_gpu_cache_usage_perc > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "vLLM GPU cache usage is critical"
- alert: VLLMQueueBacklog
expr: vllm_num_requests_waiting > vllm_num_requests_running * 3
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM request queue is backing up"
- alert: VLLMNoRequests
expr: rate(vllm_request_success_total[5m]) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "vLLM is not processing any requests"Grafana dashboard (JSON snippet)
{
"title": "vLLM Performance",
"panels": [
{
"title": "GPU Memory Usage",
"type": "gauge",
"targets": [{"expr": "vllm_gpu_cache_usage_perc * 100"}],
"thresholds": {"steps": [{"value": 0, "color": "green"}, {"value": 80, "color": "yellow"}, {"value": 90, "color": "red"}]}
},
{
"title": "Request Queue",
"type": "timeseries",
"targets": [
{"expr": "vllm_num_requests_running", "legendFormat": "Running"},
{"expr": "vllm_num_requests_waiting", "legendFormat": "Waiting"}
]
}
]
}Conclusion
PagedAttention is a powerful design, yet production deployments must address both internal and external memory fragmentation. Core mitigation strategies are:
Set conservative --gpu-memory-utilization and --max-num-batched-tokens values.
Continuously monitor GPU cache usage, request queue length, and fragmentation metrics.
Perform graceful restarts when thresholds are breached.
Adjust block size and swap space based on workload characteristics.
Ongoing tuning based on actual request patterns is essential for stable, high‑throughput LLM serving.
References
vLLM official documentation: https://docs.vllm.ai/
PagedAttention paper: https://arxiv.org/abs/2309.06180
PyTorch CUDA memory management: https://pytorch.org/docs/stable/notes/cuda.html
vLLM GitHub issues on OOM: https://github.com/vllm-project/vllm/issues
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
