How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

This guide details how to replace native Transformers inference with the high‑performance vLLM engine, leveraging PagedAttention, continuous batching, tensor parallelism, and OpenAI‑compatible APIs to achieve 3‑4× higher throughput, lower latency, and scalable multi‑GPU deployments for production‑grade large language models.

Ops Community
Ops Community
Ops Community
How to Quadruple LLM Throughput with vLLM’s PagedAttention and Continuous Batching

Overview

The vLLM engine from UC Berkeley provides a high‑performance inference service for large language models (LLMs). Its core innovation, PagedAttention , replaces the traditional monolithic KV‑Cache with a virtual‑memory‑style paging system, reducing memory waste from 30‑50% to near‑zero. In production tests the same hardware that delivered only 15‑20 tokens/s with the native transformers library now achieves 3‑4× higher throughput and sub‑second latency for models such as LLaMA‑2‑13B.

Key Features

PagedAttention : block‑wise KV‑Cache management that eliminates fragmentation.

Continuous Batching : new requests are added to a running batch without waiting for the whole batch to finish, dramatically reducing queue latency.

Tensor Parallelism : native multi‑GPU support; a model can be split across any number of GPUs with minimal code changes.

OpenAI‑compatible API : /v1/completions and /v1/chat/completions endpoints allow seamless migration from existing OpenAI‑based services.

Prefix Caching : re‑uses KV‑Cache for identical system prompts, cutting first‑token latency by ~40% and increasing throughput by ~25%.

Multi‑model serving : load several models in a single process and route requests by model name.

Typical Use Cases

High‑concurrency inference (hundreds to thousands of simultaneous requests).

Long‑context generation (8K‑token outputs or more) where KV‑Cache efficiency is critical.

Multi‑tenant SaaS platforms that need to serve many users on limited hardware.

Real‑time interactive applications (chatbots, live translation) that require sub‑second response times.

Cost‑optimized deployments where GPU budget is constrained.

Environment Requirements

Python 3.8‑3.11 (3.10 recommended; Python 3.12 not yet supported).

CUDA ≥ 11.8 (12.1+ recommended) with NVIDIA driver 525+.

PyTorch 2.0+ (2.1+ for vLLM 0.3.x).

vLLM ≥ 0.3.2.

Optional: Ray ≥ 2.9 for distributed serving.

Hardware Recommendations

Development & testing: 1× T4 (16 GB) – suitable for 7B‑13B models.

Small‑scale production: 1× A10 (24 GB) – 13B model or 7B high‑concurrency.

Medium‑scale production: 2× A100 (40 GB) – 70B model or multi‑model serving.

Large‑scale production: 4× A100 (80 GB) – ultra‑large models or extreme concurrency.

Performance Benchmarks

LLaMA‑2‑7B on 1× A10 – 80 tokens/s, P50 150 ms, P99 500 ms, max 200 concurrent requests.

LLaMA‑2‑13B on 1× A100 (40 GB) – 60 tokens/s, P50 200 ms, P99 800 ms, max 150 concurrent requests.

LLaMA‑2‑70B on 4× A100 (40 GB) – 25 tokens/s, P50 500 ms, P99 1500 ms, max 50 concurrent requests.

Qwen‑14B on 1× A100 (40 GB) – 65 tokens/s, P50 180 ms, P99 700 ms, max 180 concurrent requests.

Deployment Guide

1. System Check

# Verify OS, Python, CUDA, GPU and memory
cat /etc/os-release
python3 --version
nvcc --version
nvidia-smi
nvidia-smi --query-gpu=index,name,driver_version,memory.total --format=csv
free -h
df -h

2. Install CUDA (if missing)

# Example: CUDA 12.1
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
sudo sh cuda_12.1.0_530.30.02_linux.run
export PATH=/usr/local/cuda-12.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH
nvcc --version

3. Create a Python Virtual Environment

# Install venv and create the environment
sudo apt update && sudo apt install -y python3-pip python3-venv
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip setuptools wheel
which python
python --version

4. Install vLLM and Optional Dependencies

# Install vLLM (recommended version 0.3.2)
pip install vllm==0.3.2
# Or install from source for custom patches
# git clone https://github.com/vllm-project/vllm.git
# cd vllm && pip install -e .
# Optional: Ray for distributed serving, FastAPI & Uvicorn for HTTP gateway
pip install ray[default]==2.9.0 fastapi uvicorn
# Verify installation
python -c "import vllm, torch; print('vLLM', vllm.__version__); print('PyTorch', torch.__version__, 'CUDA', torch.cuda.is_available())"

5. Core Configuration (Single‑Node)

# Minimal launch (development)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --host 0.0.0.0 \
  --port 8000

# Production‑grade launch (example for LLaMA‑2‑13B)
python -m vllm.entrypoints.openai.api_server \
  --model /models/llama-2-13b-chat \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 256 \
  --max-model-len 4096 \
  --dtype float16 \
  --trust-remote-code

Parameter explanation : --model: local path or HuggingFace model ID. --tensor-parallel-size: number of GPUs for tensor parallelism (1 = single‑GPU). --gpu-memory-utilization: fraction of GPU memory reserved for KV‑Cache (recommended 0.85‑0.95 for single‑model workloads). --max-num-seqs: maximum concurrent sequences; tune according to model size and GPU memory. --max-model-len: maximum total token length (input + output). --dtype: model precision (float16, bfloat16, or float32). --enable-prefix-caching: set to true when many requests share the same system prompt.

6. Systemd Service (Production)

# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
Type=simple
User=vllm
Group=vllm
WorkingDirectory=/opt/vllm
Environment="PATH=/home/vllm/vllm-env/bin:/usr/local/cuda-12.1/bin:/usr/bin"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64"
Environment="CUDA_VISIBLE_DEVICES=0"
ExecStart=/home/vllm/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
  --model /models/llama-2-13b-chat \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 256 \
  --max-model-len 4096 \
  --dtype float16
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
sudo systemctl status vllm

7. Verify the Service

# Health check
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Simple completion test
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-2-7b-chat", "prompt": "San Francisco is a", "max_tokens": 50, "temperature": 0.7}'

Example Code

Python API (OpenAI client)

# examples/basic_inference.py
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
# Text completion
completion = client.completions.create(
    model="llama-2-7b-chat",
    prompt="Once upon a time",
    max_tokens=100,
    temperature=0.7,
    top_p=0.9,
)
print(completion.choices[0].text)
# Chat completion
chat = client.chat.completions.create(
    model="llama-2-7b-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    max_tokens=100,
    temperature=0.7,
)
print(chat.choices[0].message.content)

Streaming Output

# examples/streaming_inference.py
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
stream = client.chat.completions.create(
    model="llama-2-7b-chat",
    messages=[{"role": "user", "content": "Write a short story about AI."}],
    max_tokens=500,
    temperature=0.7,
    stream=True,
)
print("AI Response: ", end="")
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Batch Inference (async)

# examples/batch_inference.py
import asyncio
from openai import AsyncOpenAI

async def generate_text(client, prompt, idx):
    resp = await client.completions.create(
        model="llama-2-7b-chat",
        prompt=prompt,
        max_tokens=50,
    )
    return idx, resp.choices[0].text

async def batch_inference():
    client = AsyncOpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
    prompts = [
        "The future of AI is",
        "Machine learning can",
        "Deep learning models",
        "Natural language processing",
        "Computer vision technology",
    ]
    tasks = [generate_text(client, p, i) for i, p in enumerate(prompts)]
    results = await asyncio.gather(*tasks)
    for idx, text in sorted(results):
        print(f"Prompt {idx}: {prompts[idx]}")
        print(f"Response: {text}
")

asyncio.run(batch_inference())

Real‑World Deployments

1. High‑Concurrency Chatbot Service

Deploy a 13B model on a 2× A100 (40 GB) node with tensor parallelism and prefix caching. The configuration sustains >500 QPS while keeping P99 latency under 1 s.

# Launch command for the chatbot
python -m vllm.entrypoints.openai.api_server \
  --model /models/llama-2-13b-chat \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 512 \
  --max-model-len 2048 \
  --dtype float16 \
  --enable-prefix-caching

Sample FastAPI gateway (optional):

# chatbot/app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

class ChatRequest(BaseModel):
    user_id: str
    message: str
    history: list = []

class ChatResponse(BaseModel):
    response: str
    tokens_used: int

@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
    messages = [{"role": "system", "content": "You are a helpful AI assistant."}]
    messages.extend(req.history[-5:])
    messages.append({"role": "user", "content": req.message})
    try:
        completion = await client.chat.completions.create(
            model="llama-2-13b-chat",
            messages=messages,
            max_tokens=512,
            temperature=0.7,
            top_p=0.9,
        )
        return ChatResponse(
            response=completion.choices[0].message.content,
            tokens_used=completion.usage.total_tokens,
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

2. Distributed Multi‑Node Cluster for a 70B Model

Use Ray Serve to run three replicas, each with 4 A100 GPUs (tensor parallel size = 4). Nginx provides load balancing and health checks.

# distributed/deploy_ha_vllm.py
import ray
from ray import serve
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine

ray.init(address="auto")
serve.start(detached=True)

@serve.deployment(name="vllm-70b", num_replicas=3,
                 ray_actor_options={"num_cpus": 16, "num_gpus": 4, "resources": {"node_type": "worker"}},
                 max_concurrent_queries=100,
                 health_check_period_s=10,
                 health_check_timeout_s=30)
class VLLMDeployment:
    def __init__(self):
        engine_args = AsyncEngineArgs(
            model="/models/llama-2-70b-chat",
            tensor_parallel_size=4,
            gpu_memory_utilization=0.95,
            max_num_seqs=256,
            max_model_len=4096,
            dtype="float16",
            trust_remote_code=True,
        )
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)

    async def generate(self, prompt: str, **kwargs):
        from vllm import SamplingParams
        sampling_params = SamplingParams(
            temperature=kwargs.get("temperature", 0.7),
            top_p=kwargs.get("top_p", 0.9),
            max_tokens=kwargs.get("max_tokens", 512),
        )
        request_id = f"req-{hash(prompt)}"
        results = self.engine.generate(prompt, sampling_params, request_id)
        final_output = None
        async for out in results:
            final_output = out
        return {"text": final_output.outputs[0].text, "tokens": len(final_output.outputs[0].token_ids)}

deployment = VLLMDeployment.bind()
serve.run(deployment, name="vllm-ha-service", route_prefix="/v1/generate")
print("Distributed vLLM service deployed successfully!")

Corresponding Nginx configuration (load‑balancing across the three workers):

# docker/nginx.conf
upstream vllm_backend {
    least_conn;
    server worker1:8000 max_fails=3 fail_timeout=30s;
    server worker2:8000 max_fails=3 fail_timeout=30s;
    server worker3:8000 max_fails=3 fail_timeout=30s;
    keepalive 64;
}
server {
    listen 80;
    location / { proxy_pass http://vllm_backend; }
    location /health { proxy_pass http://vllm_backend/health; }
}

Best Practices & Safety

Performance Optimization

GPU memory utilization : start at 0.85 and increase to 0.95 for single‑model workloads; keep below 0.80 for multi‑model serving.

max‑num‑seqs : tune per model – 7B (256‑512), 13B (128‑256), 70B (64‑128). Larger values increase throughput but raise OOM risk.

Prefix caching : enable when many requests share the same system prompt; expect ~40% reduction in first‑token latency.

Model pre‑loading : place model files on a fast SSD/NVMe or mount a tmpfs for the model directory to reduce load time.

Tensor parallelism : ensure GPUs are connected via NVLink/NVSwitch for optimal bandwidth; avoid exceeding the number of GPUs available.

Security Hardening

Wrap vLLM with a FastAPI gateway that validates a JWT token (see security/api_auth.py).

Apply rate limiting with slowapi – e.g., 100 requests/min for completions, 50/min for chat.

Filter input/output for sensitive words and enforce a maximum input length (e.g., 4096 tokens).

Terminate the service behind an Nginx reverse proxy with TLS termination; restrict inbound ports to 8000 (API) and 22 (SSH).

High Availability

Active‑Passive Failover : use Keepalived with a health‑check script that calls curl http://localhost:8000/health. The virtual IP floats to the healthy node.

Automatic Restart : systemd service with Restart=always and RestartSec=10 ensures rapid recovery from crashes.

Ray Serve Replicas : configure num_replicas and let Ray automatically route traffic and restart failed replicas.

Common Pitfalls & Troubleshooting

CUDA out of memory : reduce --gpu-memory-utilization to ≤ 0.90, lower --max-num-seqs, or add more GPU memory.

Model loading timeout : use SSD/NVMe storage, increase --load-timeout (e.g., --load-timeout 600), or pre‑load the model into tmpfs.

Connection refused : verify the service is running, ensure the chosen port is free ( netstat -tulnp | grep 8000), and check firewall rules ( sudo ufw allow 8000/tcp).

Invalid model path : confirm the path exists and is readable by the vllm user ( sudo chown -R vllm:vllm /models).

Tensor parallel size mismatch : run nvidia-smi -L to see available GPUs and set --tensor-parallel-size accordingly.

API request timeout : increase client timeout or reduce max_tokens per request.

Worker process died : inspect logs with journalctl -u vllm -n 100 and monitor GPU memory usage.

Ray cluster connection failed : run ray status, ensure ports 6379 (head) and 8265 (dashboard) are open, and verify network connectivity between nodes.

Monitoring & Alerting

Prometheus Metrics (exposed at /metrics )

vllm:num_requests_running

– active requests. vllm:num_requests_waiting – queued requests. vllm:gpu_cache_usage_perc – KV‑Cache memory usage (% of GPU memory). vllm:time_to_first_token_seconds – first‑token latency. vllm:time_per_output_token_seconds – per‑token generation time. vllm:generation_tokens_total – total generated tokens (throughput).

Grafana Dashboard (sample JSON)

{
  "dashboard": {
    "title": "vLLM Performance Dashboard",
    "panels": [
      {"title": "Requests Per Second", "targets": [{"expr": "rate(vllm:num_requests_running[1m])"}]},
      {"title": "GPU Utilization", "targets": [{"expr": "nvidia_gpu_utilization"}]},
      {"title": "P95 TTFT", "targets": [{"expr": "histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m]))"}]}
    ]
  }
}

Alert Rules (Prometheus)

# alert-rules.yml
groups:
- name: vllm_alerts
  interval: 30s
  rules:
  - alert: VLLMHighLatency
    expr: histogram_quantile(0.95, rate(vllm:time_to_first_token_seconds_bucket[5m])) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "vLLM first‑token latency high"
      description: "P95 latency > 500 ms (value={{ $value }}s)"
  - alert: VLLMHighQueueLength
    expr: vllm:num_requests_waiting > 50
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "vLLM request queue growing"
      description: "Waiting requests = {{ $value }}"
  - alert: VLLMGPUMemoryHigh
    expr: vllm:gpu_cache_usage_perc > 95
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "vLLM GPU memory usage high"
      description: "GPU cache usage = {{ $value }}%"
  - alert: VLLMServiceDown
    expr: up{job="vllm"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "vLLM service unavailable"
      description: "vLLM has stopped responding"

Backup & Restore

Backup Script (bash)

# backup-vllm.sh
set -e
BACKUP_DIR="/backup/vllm/$(date +%Y%m%d-%H%M%S)"
mkdir -p "${BACKUP_DIR}"/{config,models,logs}
# Config files
cp /etc/systemd/system/vllm.service "${BACKUP_DIR}/config/"
cp /opt/vllm/*.sh "${BACKUP_DIR}/config/" 2>/dev/null || true
cp /etc/nginx/nginx.conf "${BACKUP_DIR}/config/" 2>/dev/null || true
# Model metadata (no heavy weights)
ls -lR /models/ > "${BACKUP_DIR}/models/model-list.txt"
find /models/ -name "config.json" -exec cp --parents {} "${BACKUP_DIR}/models/" \;
# Recent logs (last 7 days)
find /var/log/ -name "*vllm*" -mtime -7 -exec cp {} "${BACKUP_DIR}/logs/" \;
journalctl -u vllm --since "7 days ago" > "${BACKUP_DIR}/logs/vllm-journal.log"
# Monitoring configs
cp /etc/prometheus/prometheus.yml "${BACKUP_DIR}/config/" 2>/dev/null || true
cp /etc/grafana/grafana.ini "${BACKUP_DIR}/config/" 2>/dev/null || true
# Compress
cd /backup/vllm
tar -czf vllm-backup-$(date +%Y%m%d-%H%M%S).tar.gz $(basename "${BACKUP_DIR}")
# Optional remote upload (e.g., AWS S3)
# aws s3 cp vllm-backup-$(date +%Y%m%d-%H%M%S).tar.gz s3://my-backup-bucket/vllm/
# Cleanup old backups (>30 days)
find /backup/vllm -name "vllm-backup-*.tar.gz" -mtime +30 -delete

echo "Backup completed: ${BACKUP_DIR}"

Restore Procedure

Stop services: sudo systemctl stop vllm nginx.

Extract the backup archive: tar -xzf vllm-backup-YYYYMMDD-HHMMSS.tar.gz -C /.

Restore systemd unit:

sudo cp /etc/systemd/system/vllm.service /etc/systemd/system/ && sudo systemctl daemon-reload

.

Restore Nginx config if used: sudo cp /etc/nginx/nginx.conf /etc/nginx/ and verify with sudo nginx -t.

Ensure model files are present and owned by the vllm user: sudo chown -R vllm:vllm /models.

Start services: sudo systemctl start vllm && sudo systemctl start nginx.

Verify health endpoint and run a test completion request.

Restart Prometheus/Grafana if they were stopped.

Conclusion

Switching from the native transformers inference pipeline to vLLM yields a 3‑4× increase in throughput and a substantial reduction in latency for large language models. The combination of PagedAttention , continuous batching, tensor parallelism, and optional prefix caching makes vLLM suitable for a wide range of production workloads—from high‑concurrency chatbots to long‑context generation and multi‑tenant SaaS platforms. Proper tuning of GPU memory utilization, max‑num‑seqs, and enabling security hardening (authentication, rate limiting, content filtering) ensures a robust, scalable, and safe deployment. Integrated monitoring with Prometheus/Grafana and a well‑defined HA architecture (Ray Serve, Nginx, Keepalived) provide the observability and resilience required for mission‑critical AI services.

vLLMGPU OptimizationContinuous batchingOpenAI API CompatibilityPagedAttention
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.