24 min read

Taming vLLM OOM: Real‑World Causes and Proven Fixes for Production

This article examines why vLLM experiences out‑of‑memory errors in production, explains memory fragmentation caused by PagedAttention, outlines four typical OOM scenarios with concrete command‑line solutions, and provides deep analysis, configuration scripts, dynamic tuning, troubleshooting flowcharts, monitoring alerts, and best‑practice recommendations.

MaGe Linux Operations

Dec 26, 2025

Taming vLLM OOM: Real‑World Causes and Proven Fixes for Production

Overview

vLLM introduced PagedAttention in 2023, which partitions KV cache into fixed‑size blocks to reduce memory waste. In production, however, fragmentation and OOM become serious problems, especially under high concurrency and variable‑length requests.

Why fragmentation occurs

Traditional KV cache pre‑allocates

max_seq_len * num_layers * num_heads * head_dim * 2 * dtype_size

bytes per request. If the actual token count is far smaller, memory is wasted, and heterogeneous request sizes make batching difficult.

PagedAttention splits KV cache into 16‑token blocks and allocates them dynamically, but two fragmentation patterns appear:

Internal fragmentation : a request that uses only part of a block leaves most of the block unused (e.g., 17 tokens need two blocks, the second block is 93.75 % empty).

External fragmentation : mixing short and long requests can leave gaps that prevent a new long request from finding a contiguous set of blocks.

Environment

Component          Version
vLLM              0.5.4 / 0.6.x
CUDA               12.1
PyTorch            2.3.0
GPU                A100 80GB / A800 80GB
Model              Qwen2-72B-Instruct

Four Typical OOM Scenarios

1. Startup OOM

vLLM loads model weights and pre‑allocates the KV‑cache block pool. The default --gpu-memory-utilization of 0.90 can fill the GPU, leaving insufficient space for the cache.

Solution:

# Option 1: limit GPU memory utilization
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2-72B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85  # safe value

# Option 2: CPU offload (reduces performance)
python -m vllM.entrypoints.openai.api_server \
    --model /data/models/Qwen2-72B-Instruct \
    --tensor-parallel-size 4 \
    --cpu-offload-gb 20

2. Runtime burst OOM

Sudden influx of long prompts, many prefill requests, or a few extremely long contexts can exceed the KV‑cache pool.

Example log:

2024-03-15 14:32:45 ERROR vllm.worker: CUDA out of memory during forward pass
2024-03-15 14:32:45 ERROR vllm.engine: Request req_12345 failed: OOM

Solution: limit the number of tokens processed per batch.

python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2-72B-Instruct \
    --tensor-parallel-size 4 \
    --max-num-seqs 32 \
    --max-model-len 32768 \
    --max-num-batched-tokens 16384

The flag --max-num-batched-tokens caps the token count per forward pass, preventing OOM at the cost of reduced throughput. Recommended values based on GPU memory are:

GPU Memory   Recommended max‑num‑batched‑tokens
24 GB        4096
40 GB        8192
80 GB        16384‑24576

3. Chronic OOM after long‑running

After days of operation the PyTorch CUDA allocator fragments, even though vLLM uses block‑level management.

Diagnostic script (prints fragmentation metrics):

import torch

def print_cuda_memory_stats():
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}:")
        print(f"  Allocated: {torch.cuda.memory_allocated(i)/1024**3:.2f} GB")
        print(f"  Reserved:  {torch.cuda.memory_reserved(i)/1024**3:.2f} GB")
        print(f"  Free:     {(torch.cuda.memory_reserved(i)-torch.cuda.memory_allocated(i))/1024**3:.2f} GB")
        stats = torch.cuda.memory_stats(i)
        frag = stats.get('active_bytes.all.peak',0) / stats.get('reserved_bytes.all.peak',1)
        print(f"  Fragmentation: {frag:.2%}")

If fragmentation exceeds ~70 %, restart the service or adjust the allocator:

export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2-72B-Instruct \
    --enable-prefix-caching \
    ...

4. Prefix‑caching OOM

Prefix caching keeps KV blocks for identical prompts. Low hit‑rate with many distinct prefixes can exhaust memory.

Inspect cache usage:

import requests
metrics = requests.get("http://localhost:8000/metrics").text
for line in metrics.split("
"):
    if "prefix_cache" in line:
        print(line)

Typical output:

vllm_prefix_cache_blocks_used 12345
vllm_prefix_cache_blocks_total 20000
vllm_prefix_cache_hit_rate 0.23

Solutions:

Disable prefix caching when hit‑rate < 0.3.

Limit cache size via --max-num-seqs or avoid enabling the feature.

Deep Dive into Fragmentation

Block table structure

Request A: [0, 1, 5, 9]   # uses 4 blocks
Request B: [2, 3, 4, 6, 7, 8]  # uses 6 blocks
Request C: [10, 11]      # uses 2 blocks

When Request A finishes, its blocks become free but are non‑contiguous, which can affect CUDA copy efficiency, L2 cache hit rate, and TLB misses.

Quantifying fragmentation

def analyze_fragmentation(vllm_url="http://localhost:8000"):
    metrics = requests.get(f"{vllm_url}/metrics").text
    # parse key metrics …
    total = metrics_dict.get("vllm_num_gpu_blocks",0)
    used  = metrics_dict.get("vllm_num_gpu_blocks_used",0)
    free  = total - used
    running = metrics_dict.get("vllm_num_requests_running",0)
    avg_blocks = used / max(running,1)
    potential = free / max(avg_blocks,1)
    print(f"Total blocks: {total}")
    print(f"Used blocks: {used}")
    print(f"Free blocks: {free}")
    print(f"Running requests: {running}")
    print(f"Avg blocks/request: {avg_blocks:.1f}")
    print(f"Potential new requests: {potential:.1f}")
    print(f"Memory utilization: {used/total*100:.1f}%")
    return {"total_blocks":total,"used_blocks":used,"free_blocks":free,"utilization":used/total}

Choosing block size

Default block size = 16 tokens. Trade‑offs:

Block Size | Advantages          | Disadvantages
-----------|---------------------|-----------------
8          | Less internal waste| Larger block table, higher overhead
16         | Balanced            | —
32         | Lower management cost| More internal waste

Guidelines:

Use small block size (8) when request lengths vary widely (e.g., 100 → 10 000 tokens).

Use default (16) or larger (32) when lengths are relatively uniform.

Example to reduce internal fragmentation:

python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2-72B-Instruct \
    --block-size 8  # less internal waste
    ...

Remember to adjust --gpu-memory-utilization accordingly because block size changes total KV‑cache capacity.

Production‑Ready Configuration

Startup script

#!/bin/bash
set -e
export CUDA_VISIBLE_DEVICES=0,1,2,3
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8"
export VLLM_ATTENTION_BACKEND=FLASH_ATTN

MODEL_PATH="/data/models/Qwen2-72B-Instruct"
TP_SIZE=4
PORT=8000

GPU_MEMORY_UTILIZATION=0.88   # A100 80GB
MAX_MODEL_LEN=32768
MAX_NUM_SEQS=64
MAX_NUM_BATCHED_TOKENS=24576

python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --tensor-parallel-size $TP_SIZE \
    --port $PORT \
    --host 0.0.0.0 \
    --gpu-memory-utilization $GPU_MEMORY_UTILIZATION \
    --max-model-len $MAX_MODEL_LEN \
    --max-num-seqs $MAX_NUM_SEQS \
    --max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
    --block-size 16 \
    --swap-space 8 \
    --disable-log-requests \
    --enable-chunked-prefill \
    --max-num-on-the-fly-batches 4 \
    --dtype bfloat16 \
    --trust-remote-code \
    2>&1 | tee -a /var/log/vllm/vllm_$(date +%Y%m%d).log

Key parameters explained

--swap-space 8 : Enables swapping unused KV blocks to CPU (8 GB).

--enable-chunked-prefill : Splits long prompts into chunks to avoid a single huge prefill.

--max-num-on-the-fly-batches 4 : Controls concurrent GPU batches; higher values increase memory pressure.

Dynamic tuning sidecar

import requests, time, subprocess

class VLLMTuner:
    def __init__(self, url="http://localhost:8000"):
        self.url = url
        self.history = []

    def get_metrics(self):
        try:
            r = requests.get(f"{self.url}/metrics", timeout=5)
            return self.parse_metrics(r.text)
        except:
            return None

    def parse_metrics(self, txt):
        d = {}
        for line in txt.split("
"):
            if line and not line.startswith("#"):
                parts = line.split()
                if len(parts) >= 2:
                    d[parts[0]] = float(parts[1])
        return d

    def should_restart(self, m):
        if not m:
            return False
        if m.get("vllm_gpu_cache_usage_perc",0) > 0.95:
            self.history.append(("high_memory", time.time()))
        if m.get("vllm_num_requests_waiting",0) > m.get("vllm_num_requests_running",0)*2:
            self.history.append(("queue_backlog", time.time()))
        recent = [h for h in self.history if time.time()-h[1] < 300]
        return len(recent) >= 3

    def graceful_restart(self):
        print("Initiating graceful restart...")
        for _ in range(60):
            m = self.get_metrics()
            if m and m.get("vllm_num_requests_running",0) == 0:
                break
            time.sleep(1)
        subprocess.run(["systemctl","restart","vllm"])
        print("Restart completed")

    def run(self):
        while True:
            m = self.get_metrics()
            if m:
                self.log_status(m)
                if self.should_restart(m):
                    self.graceful_restart()
                    self.history = []
            time.sleep(30)

    def log_status(self, m):
        print(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] Running:{m.get('vllm_num_requests_running',0)} "
              f"Waiting:{m.get('vllm_num_requests_waiting',0)} "
              f"GPU Cache:{m.get('vllm_gpu_cache_usage_perc',0)*100:.1f}%")

if __name__ == "__main__":
    VLLMTuner().run()

Troubleshooting Flowchart

OOM occurs
│
├─→ Startup OOM?
│   ├─ Yes → lower --gpu-memory-utilization
│   └─ No ↓
│
├─→ Burst OOM?
│   ├─ Yes → check for ultra‑long requests, lower --max-num-batched-tokens
│   └─ No ↓
│
├─→ Chronic OOM?
│   ├─ Yes → inspect CUDA fragmentation, consider periodic restart
│   └─ No ↓
│
└─→ Prefix‑caching OOM? → evaluate hit‑rate, disable or cap cache

Common diagnostic commands

nvidia-smi pmon -s um -d 1

– GPU memory usage. python -c "import torch; print(torch.cuda.memory_summary())" – detailed CUDA stats.

curl -s http://localhost:8000/metrics | grep -E "(vllm_num|vllm_gpu)"

– vLLM metrics.

watch -n 1 'nvidia-smi --query-gpu=index,memory.used,memory.total,utilization.gpu --format=csv'

– real‑time GPU monitor.

grep -i "oom\|memory\|cuda" /var/log/vllm/vllm_*.log | tail -50

– recent OOM logs.

Emergency recovery

# Kill stray vLLM processes
pkill -9 -f "vllm.entrypoints"

# Reset GPU memory (use with caution)
nvidia-smi --gpu-reset -i 0,1,2,3

# Clear CUDA cache
python -c "import torch; torch.cuda.empty_cache()"

# Restart with conservative config
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2-72B-Instruct \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 16 \
    --max-num-batched-tokens 8192

Monitoring & Alerting

Prometheus alerts (example)

groups:
- name: vllm
  rules:
  - alert: VLLMHighMemoryUsage
    expr: vllm_gpu_cache_usage_perc > 0.90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM GPU cache usage is high"
      description: "GPU cache usage is {{ $value | humanizePercentage }}"

  - alert: VLLMCriticalMemoryUsage
    expr: vllm_gpu_cache_usage_perc > 0.95
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "vLLM GPU cache usage is critical"

  - alert: VLLMQueueBacklog
    expr: vllm_num_requests_waiting > vllm_num_requests_running * 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM request queue is backing up"

  - alert: VLLMNoRequests
    expr: rate(vllm_request_success_total[5m]) == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "vLLM is not processing any requests"

Grafana dashboard (core panels)

{
  "title": "vLLM Performance",
  "panels": [
    {"title":"GPU Memory Usage","type":"gauge","targets":[{"expr":"vllm_gpu_cache_usage_perc * 100"}]},
    {"title":"Request Queue","type":"timeseries","targets":[{"expr":"vllm_num_requests_running","legendFormat":"Running"},{"expr":"vllm_num_requests_waiting","legendFormat":"Waiting"}]},
    {"title":"Token Throughput","type":"timeseries","targets":[{"expr":"rate(vllm_generation_tokens_total[1m])","legendFormat":"tokens/s"}]},
    {"title":"Request Latency P99","type":"timeseries","targets":[{"expr":"histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m]))","legendFormat":"P99"}]}
  ]
}

Conclusion

PagedAttention reduces KV‑cache waste but introduces internal and external fragmentation that can trigger OOM in production. Preventive measures include tuning --gpu-memory-utilization and --max-num-batched‑tokens, monitoring GPU cache usage and fragmentation, and applying the concrete configuration and restart strategies described above. Continuous profiling and adjustment are required because optimal settings depend on workload characteristics and hardware.

Python deployment vLLM GPU Memory Fragmentation OOM

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.