vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

This article analyzes why vLLM's PagedAttention can cause GPU memory fragmentation and out‑of‑memory errors in production, presents four typical OOM scenarios, and provides concrete diagnostics, configuration tweaks, code examples, and monitoring strategies to eliminate the problem.

Raymond Ops
Raymond Ops
Raymond Ops
vLLM Production Pitfalls: The Ultimate Fix for PagedAttention Memory Fragmentation and OOM

Overview

vLLM introduced PagedAttention to reduce KV cache memory waste by paging fixed‑size blocks (default 16 tokens). In production, memory fragmentation and OOM appear.

Why fragmentation occurs

Traditional KV cache pre‑allocates

max_seq_len * num_layers * num_heads * head_dim * 2 * dtype_size

per request, causing waste when the actual token count is lower. PagedAttention splits KV cache into blocks, but leads to:

Internal fragmentation : a request of 17 tokens needs two 16‑token blocks; the second block uses only 1/16 of its capacity (93.75 % wasted).

External fragmentation : mixing short and long requests leaves gaps that prevent a new request from obtaining the required number of blocks, even though the kernel can handle non‑contiguous memory.

Four typical OOM scenarios

Startup OOM

vLLM pre‑allocates as many blocks as possible before any request, exhausting GPU memory. Example error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 79.35 GiB total capacity; 75.21 GiB already allocated; 1.83 GiB free; 76.50 GiB reserved)

Solution: limit --gpu-memory-utilization (e.g., 0.85) or enable CPU offload.

Sudden runtime OOM

Occurs when a burst of long prompts or many concurrent prefill requests exceeds KV cache. Example log:

2024-03-15 14:32:45 ERROR vllm.worker: CUDA out of memory during forward pass
2024-03-15 14:32:45 ERROR vllm.engine: Request req_12345 failed: OOM

Solution: limit --max-num-batched-tokens (e.g., 16384) and adjust --max-num-seqs to control batch size.

Chronic OOM after days

Long‑running services accumulate CUDA allocator fragmentation because vLLM relies on PyTorch’s allocator. Diagnostic script prints torch.cuda.memory_summary and computes a fragmentation percentage; values below 70 % indicate severe fragmentation.

for i in range(torch.cuda.device_count()):
    print(f"GPU {i}: {torch.cuda.memory_summary(i)}")

Solutions:

Periodic service restart (e.g., daily CronJob).

Set PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" to allow expandable segments.

Enable vLLM’s built‑in memory compaction (v0.5.0+).

Prefix caching OOM

Prefix caching keeps KV cache for repeated prompts. With many distinct prefixes the cache grows unbounded. Example metrics query:

import requests
response = requests.get("http://localhost:8000/metrics")
for line in response.text.split("
"):
    if "prefix_cache" in line:
        print(line)

Typical output:

vllm_prefix_cache_blocks_used 12345
vllm_prefix_cache_blocks_total 20000
vllm_prefix_cache_hit_rate 0.23

Solutions:

Disable prefix caching when hit rate < 0.3 (omit --enable-prefix-caching).

Limit cache size indirectly by reducing --max-num-seqs.

Deep analysis of memory fragmentation

Block table structure

Request A: [0, 1, 5, 9]   # 4 blocks
Request B: [2, 3, 4, 6, 7, 8]   # 6 blocks
Request C: [10, 11]   # 2 blocks

When a request finishes, its blocks become free but may be non‑contiguous, affecting cache efficiency.

Impact of non‑contiguous blocks

Reduced CUDA memory copy efficiency.

Lower L2 cache hit rate.

Increased TLB misses.

Quantifying fragmentation

Script print_cuda_memory_stats prints allocated, reserved, and free memory and computes fragmentation percentage. Example snippet:

for i in range(torch.cuda.device_count()):
    print(f"GPU {i}: {torch.cuda.memory_summary(i)}")

If the fragmentation metric is below 70 %, the allocator is heavily fragmented.

Choosing block size

8‑token blocks : minimal internal fragmentation, but larger block table and higher management overhead.

16‑token blocks (default) : balanced trade‑off.

32‑token blocks : lower management overhead, but more internal fragmentation.

Recommendation: use 8‑token blocks for workloads with highly variable context lengths; otherwise keep 16 or increase to 32 for uniform workloads.

Production‑ready configuration

Startup script

#!/bin/bash
set -e
export CUDA_VISIBLE_DEVICES=0,1,2,3
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8"
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
MODEL_PATH="/data/models/Qwen2-72B-Instruct"
TP_SIZE=4
PORT=8000
GPU_MEMORY_UTILIZATION=0.88
MAX_MODEL_LEN=32768
MAX_NUM_SEQS=64
MAX_NUM_BATCHED_TOKENS=24576
python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --tensor-parallel-size $TP_SIZE \
    --port $PORT \
    --host 0.0.0.0 \
    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
    --max-model-len $MAX_MODEL_LEN \
    --max-num-seqs $MAX_NUM_SEQS \
    --max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
    --block-size 16 \
    --swap-space 8 \
    --disable-log-requests \
    --enable-chunked-prefill \
    --max-num-on-the-fly-batches 4 \
    --dtype bfloat16 \
    --trust-remote-code \
    2>&1 | tee -a /var/log/vllm/vllm_$(date +%Y%m%d).log

Key parameters explained

--swap-space 8

: allocates an 8 GB CPU swap for KV cache, allowing low‑priority requests to be swapped out. --enable-chunked-prefill: splits long prompts into chunks, preventing a single prefill from exhausting GPU memory (introduced in v0.4.0). --max-num-on-the-fly-batches 4: limits the number of concurrent GPU batches; higher values increase memory pressure.

Dynamic configuration sidecar

A Python sidecar monitors vLLM metrics and triggers a graceful restart when GPU cache usage stays above 95 % or the request backlog exceeds twice the running count for more than five minutes.

import requests, time, subprocess

class VLLMTuner:
    def __init__(self, url="http://localhost:8000"):
        self.url = url
        self.history = []

    def get_metrics(self):
        try:
            r = requests.get(f"{self.url}/metrics", timeout=5)
            return self.parse_metrics(r.text)
        except:
            return None

    def parse_metrics(self, text):
        metrics = {}
        for line in text.split("
"):
            if line and not line.startswith("#"):
                parts = line.split()
                if len(parts) >= 2:
                    metrics[parts[0]] = float(parts[1])
        return metrics

    def should_restart(self, m):
        if not m:
            return False
        gpu = m.get("vllm_gpu_cache_usage_perc", 0)
        if gpu > 0.95:
            self.history.append(("high_memory", time.time()))
        waiting = m.get("vllm_num_requests_waiting", 0)
        running = m.get("vllm_num_requests_running", 0)
        if waiting > running * 2:
            self.history.append(("queue_backlog", time.time()))
        recent = [h for h in self.history if time.time() - h[1] < 300]
        return len(recent) >= 3

    def graceful_restart(self):
        print("Initiating graceful restart...")
        # Drain traffic via load balancer (implementation‑specific)
        for _ in range(60):
            m = self.get_metrics()
            if m and m.get("vllm_num_requests_running", 0) == 0:
                break
            time.sleep(1)
        subprocess.run(["systemctl", "restart", "vllm"])
        print("Restart completed")

    def run(self):
        while True:
            m = self.get_metrics()
            if m:
                self.log_status(m)
            if self.should_restart(m):
                self.graceful_restart()
                self.history = []
            time.sleep(30)

    def log_status(self, m):
        print(f"[{{time.strftime('%Y-%m-%d %H:%M:%S')}}] Running:{m.get('vllm_num_requests_running',0)} "
              f"Waiting:{m.get('vllm_num_requests_waiting',0)} "
              f"GPU Cache:{m.get('vllm_gpu_cache_usage_perc',0)*100:.1f}%")

if __name__ == "__main__":
    VLLMTuner().run()

Monitoring and alerting

Prometheus alerts

# High GPU cache usage
groups:
- name: vllm
  rules:
  - alert: VLLMHighMemoryUsage
    expr: vllm_gpu_cache_usage_perc > 0.90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM GPU cache usage is high"
      description: "GPU cache usage is {{ $value | humanizePercentage }}"

  - alert: VLLMCriticalMemoryUsage
    expr: vllm_gpu_cache_usage_perc > 0.95
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "vLLM GPU cache usage is critical"

  - alert: VLLMQueueBacklog
    expr: vllm_num_requests_waiting > vllm_num_requests_running * 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM request queue is backing up"

  - alert: VLLMNoRequests
    expr: rate(vllm_request_success_total[5m]) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "vLLM is not processing any requests"

Grafana dashboard (JSON snippet)

{
  "title": "vLLM Performance",
  "panels": [
    {
      "title": "GPU Memory Usage",
      "type": "gauge",
      "targets": [{"expr": "vllm_gpu_cache_usage_perc * 100"}],
      "thresholds": {"steps": [{"value": 0, "color": "green"}, {"value": 80, "color": "yellow"}, {"value": 90, "color": "red"}]}
    },
    {
      "title": "Request Queue",
      "type": "timeseries",
      "targets": [
        {"expr": "vllm_num_requests_running", "legendFormat": "Running"},
        {"expr": "vllm_num_requests_waiting", "legendFormat": "Waiting"}
      ]
    }
  ]
}

Conclusion

PagedAttention is a powerful design, yet production deployments must address both internal and external memory fragmentation. Core mitigation strategies are:

Set conservative --gpu-memory-utilization and --max-num-batched-tokens values.

Continuously monitor GPU cache usage, request queue length, and fragmentation metrics.

Perform graceful restarts when thresholds are breached.

Adjust block size and swap space based on workload characteristics.

Ongoing tuning based on actual request patterns is essential for stable, high‑throughput LLM serving.

References

vLLM official documentation: https://docs.vllm.ai/

PagedAttention paper: https://arxiv.org/abs/2309.06180

PyTorch CUDA memory management: https://pytorch.org/docs/stable/notes/cuda.html

vLLM GitHub issues on OOM: https://github.com/vllm-project/vllm/issues

vLLMCUDAOOMGPU memoryLLM servingPagedAttention
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.