Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks
After discovering that only a few vLLM settings truly impact performance, this guide details how adjusting gpu_memory_utilization, max_num_batched_tokens, and enabling chunked prefill can raise Qwen2.5‑72B‑Instruct throughput from ~1800 to over 2500 tokens/s, improve latency, and provides comprehensive deployment, monitoring, and troubleshooting instructions.
Overview
vLLM is a high‑performance LLM inference engine that uses PagedAttention to manage KV‑Cache fragmentation. In a production setup running Qwen2.5‑72B‑Instruct on eight A100‑80GB GPUs with tensor parallelism, baseline throughput was ~1800 tokens/s; after tuning it reached ~2500 tokens/s (≈40 % increase) and P99 latency decreased.
Environment
OS: Ubuntu 22.04 LTS
GPUs: 8 × NVIDIA A100‑80GB (NVLink interconnect)
CUDA: 12.4
Python: 3.11
PyTorch: 2.4.0+cu124
vLLM: 0.6.4.post1
Installation
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.6.4.post1
pip install -e ".[all]"
export TORCH_CUDA_ARCH_LIST="8.0;9.0"
pip install -e . --no-build-isolationModel download
pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download Qwen/Qwen2.5-72B-Instruct \
--local-dir /data/models/Qwen2.5-72B-Instruct \
--local-dir-use-symlinks FalseBenchmark script
#!/usr/bin/env python3
"""
Benchmark script for vLLM throughput testing.
"""
import time, argparse
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
def generate_prompts(num_prompts: int, prompt_len: int, tokenizer):
base = "Explain the concept of machine learning in simple terms. " * 50
tokens = tokenizer.encode(base)[:prompt_len]
prompt = tokenizer.decode(tokens)
return [prompt] * num_prompts
def run_benchmark(model_path: str, num_prompts: int = 100,
prompt_len: int = 512, output_len: int = 256,
tensor_parallel: int = 8, **kwargs):
llm = LLM(model=model_path,
tensor_parallel_size=tensor_parallel,
trust_remote_code=True,
**kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_path)
prompts = generate_prompts(num_prompts, prompt_len, tokenizer)
sampling_params = SamplingParams(temperature=0.8,
top_p=0.95,
max_tokens=output_len,
ignore_eos=True)
# warm‑up
_ = llm.generate(prompts[:10], sampling_params)
# benchmark
start = time.perf_counter()
outputs = llm.generate(prompts, sampling_params)
elapsed = time.perf_counter() - start
total_output = sum(len(o.outputs[0].token_ids) for o in outputs)
print("="*50)
print(f"Throughput: {total_output/elapsed:.2f} tokens/s")
print(f"Latency per request: {elapsed/num_prompts*1000:.2f} ms")
print("="*50)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--num-prompts", type=int, default=100)
parser.add_argument("--prompt-len", type=int, default=512)
parser.add_argument("--output-len", type=int, default=256)
parser.add_argument("--tp", type=int, default=8)
args = parser.parse_args()
run_benchmark(args.model, args.num_prompts,
args.prompt_len, args.output_len, args.tp)Key configuration changes ("magic three")
gpu_memory_utilization
# default
llm = LLM(model=model_path, gpu_memory_utilization=0.9)
# optimized
llm = LLM(model=model_path, gpu_memory_utilization=0.95)Increasing to 0.95 frees ~4 GB per GPU for KV‑Cache, giving ~15 % throughput gain. Keep ≤0.9 for very long contexts to avoid OOM.
max_num_batched_tokens
# default (auto‑calculated)
# ...
# optimized
llm = LLM(model=model_path,
max_num_batched_tokens=32768,
max_num_seqs=256)Setting the batch token limit to 32768 allows larger batches; on 8 × A100 this yields ~18 % throughput increase.
enable_chunked_prefill
# default
llm = LLM(model=model_path, enable_chunked_prefill=False)
# optimized
llm = LLM(model=model_path, enable_chunked_prefill=True)Chunked prefill splits long prompts into smaller pieces that interleave with decode, preventing long requests from blocking short ones. In mixed‑length workloads this adds ~12 % throughput and reduces P99 latency by ~20 %.
Full optimized configuration
from vllm import LLM, SamplingParams
llm = LLM(
model="/data/models/Qwen2.5-72B-Instruct",
tensor_parallel_size=8,
gpu_memory_utilization=0.95,
max_num_batched_tokens=32768,
enable_chunked_prefill=True,
max_num_seqs=256,
max_model_len=8192, # adjust if 32k context not needed
trust_remote_code=True,
dtype="bfloat16", # A100/H100 benefit from bf16
kv_cache_dtype="fp8_e5m2", # optional experimental cache quantization
)Deployment options
OpenAI‑compatible API server
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2.5-72B-Instruct \
--served-model-name qwen2.5-72b \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--max-num-batched-tokens 32768 \
--enable-chunked-prefill \
--max-num-seqs 256 \
--max-model-len 8192 \
--dtype bfloat16 \
--trust-remote-code \
--disable-log-requests \
--uvicorn-log-level warningSystemd service
# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI API Server
After=network.target
[Service]
Type=simple
User=deploy
Group=deploy
WorkingDirectory=/opt/vllm
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7"
Environment="VLLM_LOGGING_LEVEL=WARNING"
ExecStart=/opt/vllm/venv/bin/python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2.5-72B-Instruct \
--served-model-name qwen2.5-72b \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--max-num-batched-tokens 32768 \
--enable-chunked-prefill \
--max-num-seqs 256 \
--max-model-len 8192 \
--dtype bfloat16 \
--trust-remote-code
Restart=always
RestartSec=10
LimitNOFILE=65536
LimitNPROC=65536
[Install]
WantedBy=multi-user.targetDocker deployment
# Dockerfile.vllm
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
RUN apt-get update && apt-get install -y \
python3.11 python3.11-venv python3-pip git curl wget && \
rm -rf /var/lib/apt/lists/*
RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN pip install --no-cache-dir \
vllm==0.6.4.post1 \
torch==2.4.0 \
transformers>=4.45.0
RUN mkdir -p /models
WORKDIR /app
COPY healthcheck.py /app/
EXPOSE 8000
ENTRYPOINT ["python","-m","vllm.entrypoints.openai.api_server"]Kubernetes deployment
# vllm-deployment.yaml (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: ai-inference
spec:
replicas: 2
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
containers:
- name: vllm
image: your-registry/vllm:0.6.4
ports:
- containerPort: 8000
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3,4,5,6,7"
- name: VLLM_LOGGING_LEVEL
value: "WARNING"
args:
- --model
- /models/Qwen2.5-72B-Instruct
- --served-model-name
- qwen2.5-72b
- --host
- 0.0.0.0
- --port
- "8000"
- --tensor-parallel-size
- "8"
- --gpu-memory-utilization
- "0.95"
- --max-num-batched-tokens
- "32768"
- --enable-chunked-prefill
- --max-num-seqs
- "256"
- --max-model-len
- "8192"
- --dtype
- bfloat16
- --trust-remote-code
resources:
limits:
nvidia.com/gpu: 8
memory: "320Gi"
cpu: "32"
requests:
nvidia.com/gpu: 8
memory: "256Gi"
cpu: "16"
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
- name: shm
mountPath: /dev/shm
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64GiBest practices
Model‑size specific tuning
7 B model (single A100): gpu_memory_utilization=0.92, max_num_batched_tokens=16384, max_num_seqs=512, enable_chunked_prefill=True 14 B model (2 × A100): gpu_memory_utilization=0.93, max_num_batched_tokens=24576, max_num_seqs=384, enable_chunked_prefill=True 72 B model (8 × A100): values shown in the full configuration above.
KV‑Cache quantization
llm = LLM(model="Qwen2.5-72B-Instruct",
tensor_parallel_size=8,
kv_cache_dtype="fp8_e5m2") # or "fp8_e4m3"FP8 cache reduces memory usage by ~50 % and can increase throughput 20‑30 % with minor quality loss.
Prefix caching
llm = LLM(model="Qwen2.5-72B-Instruct",
enable_prefix_caching=True)When many requests share a long system prompt (e.g., 2000‑token knowledge base), first‑token latency drops from ~180 ms to ~45 ms.
Security hardening
Enable API‑key authentication via --api-key YOUR_KEY or the VLLM_API_KEY environment variable.
Place an external rate‑limiter (NGINX, API Gateway) in front of the server.
Validate prompt length and filter prohibited patterns before forwarding to the model.
High availability
Deploy multiple replicas behind a load balancer; use Kubernetes HorizontalPodAutoscaler based on vllm_num_requests_running or custom metrics.
Handle SIGTERM gracefully to finish in‑flight requests (example wrapper shown in source).
Common error handling
CUDA OOM : lower gpu_memory_utilization, reduce max_num_batched_tokens, or enable KV‑Cache quantization.
Tensor‑Parallel NCCL errors : verify GPU interconnect, match NCCL versions, set NCCL_DEBUG=INFO and NCCL_IB_DISABLE=1 if InfiniBand is unavailable.
Request timeouts : enable enable_chunked_prefill, increase client timeout, monitor vllm_num_requests_waiting.
Model loading failures : ensure model directory contains pytorch_model.bin or model.safetensors, use --trust-remote-code, and keep transformers version compatible.
Monitoring and observability
vLLM exposes Prometheus metrics at /metrics. Key metrics include: vllm_num_requests_running (Gauge) – active requests. vllm_num_requests_waiting (Gauge) – queued requests. vllm_gpu_cache_usage_perc (Gauge) – KV‑Cache utilization. vllm_request_success_total (Counter) – total successful requests. vllm_time_to_first_token_seconds (Histogram) – TTFT. vllm_time_per_output_token_seconds (Histogram) – per‑token latency.
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm-server:8000']
metrics_path: /metrics
scrape_interval: 15sBackup and recovery
Model files: rsync -avz /data/models/ backup:/backup/models/ or rclone sync /data/models/ s3:bucket/models/.
Service files: archive /etc/systemd/system/vllm.service, startup scripts, and any reverse‑proxy configs.
Monitoring configs: export Prometheus and Grafana configmaps.
Result
Adjusting the three parameters – gpu_memory_utilization (0.9 → 0.95), max_num_batched_tokens (auto → 32768), and enable_chunked_prefill (false → true) – yields roughly a 40 % throughput increase for a 72 B model on eight A100 GPUs while also reducing latency. The guide provides end‑to‑end installation, deployment (bare metal, Docker, Kubernetes), security hardening, high‑availability patterns, troubleshooting, and observability instructions.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
