Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops
This guide walks through the complete process of deploying a high‑throughput large language model inference service using vLLM, covering environment preparation, installation, configuration tuning, performance testing, real‑world case studies, monitoring, troubleshooting, and backup strategies for production‑grade deployments.
Overview
Original handcrafted inference achieved ~50 QPS on A100; switching to vLLM with PagedAttention and Continuous Batching raised throughput >200 QPS with <2 s latency.
Key Features
PagedAttention dynamically pages KV‑Cache, improving GPU memory utilization 2‑3×.
Continuous Batching lets requests join/leave a batch at any time, reducing latency for mixed‑length workloads.
OpenAI‑compatible API provides /v1/completions and /v1/chat/completions endpoints.
Environment Requirements
OS: Ubuntu 22.04 LTS
NVIDIA driver ≥ 535, CUDA 12.1+
Python 3.9‑3.11, PyTorch 2.1.2 + cu121
GPU: A100/H100/A10/L40 with ≥40 GB VRAM (24 GB minimum)
System RAM ≥ 64 GB, NVMe ≥ 1 TB
Installation Steps
System preparation
# Verify OS and GPU
cat /etc/os-release
nvidia-smi
nvcc --version
# Update system
sudo apt update && sudo apt upgrade -y
# Install build tools
sudo apt install -y build-essential cmake git curl wget python3.10 python3.10-venv python3.10-dev python3-pip ccache ninja-build libopenmpi-dev
# Install CUDA 12.1
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-1
# Set environment
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrcPython environment
# Create directory
sudo mkdir -p /opt/vllm
sudo chown $USER:$USER /opt/vllm
# Virtualenv
python3.10 -m venv /opt/vllm/venv
source /opt/vllm/venv/bin/activate
pip install --upgrade pip setuptools wheel
# Install PyTorch (CUDA 12.1)
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
# Install vLLM
pip install vllm==0.2.7
# Optional: build from source
# git clone https://github.com/vllm-project/vllm.git
# cd vllm && pip install -e .
# Verify
python -c "import vllm; print(vllm.__version__)"
# Additional runtime deps
pip install fastapi==0.104.1 uvicorn[standard]==0.24.0 pydantic==2.5.0 prometheus-client==0.19.0 aiohttp requestsModel preparation
# Create model directory
sudo mkdir -p /data/models && sudo chown $USER:$USER /data/models
# Download Llama‑2‑13B‑Chat (example)
pip install huggingface-hub
huggingface-cli download meta-llama/Llama-2-13b-chat-hf --local-dir llama2-13b-chat
# Verify files: config.json, tokenizer.json, pytorch_model-*.binOptional 4‑bit AWQ quantization
# quantize_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "/data/models/llama2-13b-chat"
quant_path = "/data/models/llama2-13b-chat-awq"
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"Quantized model saved to {quant_path}")Run with python quantize_awq.py. AWQ reduces a 13B model from ~26 GB to ~7 GB and roughly doubles throughput.
vLLM Configuration
model:
name: meta-llama/Llama-2-13b-chat-hf
path: /data/models/llama2-13b-chat-awq
tokenizer: /data/models/llama2-13b-chat-awq
quantization: awq
dtype: auto
trust_remote_code: false
server:
host: 0.0.0.0
port: 8000
api_key: "your-secret-api-key"
log_level: info
timeout: 600
engine:
tensor_parallel_size: 1
pipeline_parallel_size: 1
gpu_memory_utilization: 0.90
max_num_batched_tokens: 8192
max_num_seqs: 256
max_model_len: 4096
block_size: 16
swap_space: 4
enable_prefix_caching: true
generation:
max_tokens: 2048
temperature: 0.8
top_p: 0.95
top_k: 50
stop: ["</s>", "[INST]", "[/INST]"]Important parameter notes
gpu_memory_utilizationshould stay below 0.95; 0.88‑0.92 is safe for 24 GB GPUs. max_num_seqs and max_num_batched_tokens must be balanced; a rule of thumb is max_num_batched_tokens ≈ max_num_seqs × avg_seq_len × 2. block_size = 16 works for most workloads; use 32 only for sequences > 4 k tokens.
Enable enable_prefix_caching for multi‑turn conversations.
Service Startup
# /opt/vllm/start_server.sh
#!/bin/bash
source /opt/vllm/venv/bin/activate
python -m vllm.entrypoints.openai.api_server \
--model /data/models/llama2-13b-chat-awq \
--quantization awq \
--dtype auto \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--max-model-len 4096 \
--block-size 16 \
--swap-space 4 \
--enable-prefix-caching \
--host 0.0.0.0 \
--port 8000 \
--api-key your-secret-api-key \
2>&1 | tee /var/log/vllm/server.logsystemd unit
[Unit]
Description=vLLM Inference Service
After=network.target
[Service]
Type=simple
User=$USER
WorkingDirectory=/opt/vllm
Environment="PATH=/opt/vllm/venv/bin:/usr/local/cuda-12.1/bin:/usr/bin"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64"
ExecStart=/opt/vllm/start_server.sh
Restart=always
RestartSec=10
StandardOutput=append:/var/log/vllm/stdout.log
StandardError=append:/var/log/vllm/stderr.log
LimitNOFILE=65536
LimitNPROC=4096
[Install]
WantedBy=multi-user.targetVerification
Health check: curl http://localhost:8000/health → {"status":"ok"} Model list: curl http://localhost:8000/v1/models Completion example:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-api-key" \
-d '{"model":"/data/models/llama2-13b-chat-awq","prompt":"What is the capital of France?","max_tokens":100,"temperature":0.7}'Chat example (streaming):
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-api-key" \
-d '{"model":"/data/models/llama2-13b-chat-awq","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"写一首诗"}],"stream":true,"max_tokens":200}'Load testing
Install Locust and run a 5‑minute test with 50 concurrent users:
pip install locust
locust -f locustfile.py --host=http://localhost:8000 \
--users=50 --spawn-rate=5 --run-time=5m --headlessTarget metrics: RPS > 50, p50 latency < 2 s, GPU utilization 70‑95 %.
Best‑practice checklist
Quantization (AWQ or GPTQ) is essential for 13B models on 40 GB GPUs.
Never set gpu_memory_utilization to 1.0; keep a 10‑15 % safety margin.
Enable enable_prefix_caching for multi‑turn dialogs.
Adjust block_size according to typical sequence length.
Use tensor parallelism ( --tensor-parallel-size) when multiple GPUs are available; the size must be a power of two.
Configure API gateway timeouts (e.g., proxy_read_timeout 120s) and rate limiting to avoid overload.
Prefer graceful shutdown via systemctl stop vllm to let in‑flight requests finish.
Enable systemd Restart=always to recover from CUDA crashes.
Monitoring & Alerting
Expose vLLM stats at /stats and combine with Prometheus + Grafana. Example Prometheus scrape:
# prometheus.yml snippet
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['localhost:9090'] # vllm‑monitor exporter
- job_name: 'nvidia_gpu'
static_configs:
- targets: ['localhost:9835'] # nvidia‑gpu‑exporterKey metrics to monitor:
GPU utilization (70‑95 %)
GPU memory usage (≈ gpu_memory_utilization × total)
vLLM active requests and queue size
Prompt and generation throughput (tokens/s)
API p95 latency (< 3 s)
Error rate (< 0.1 %)
Sample alert rule for high memory usage:
- alert: VLLMHighMemoryUsage
expr: vllm_gpu_memory_used_mb / nvidia_gpu_memory_total_mb > 0.95
for: 5m
annotations:
summary: "vLLM GPU memory usage > 95%"Backup & Restore
Backup script saves configuration, startup script, systemd unit, monitoring scripts and recent logs, then archives them:
#!/bin/bash
BACKUP_DIR="/data/backups/vllm"
TS=$(date +%Y%m%d_%H%M%S)
mkdir -p "$BACKUP_DIR/backup_$TS"
cp /opt/vllm/config.yaml "$BACKUP_DIR/backup_$TS/"
cp /opt/vllm/start_server.sh "$BACKUP_DIR/backup_$TS/"
cp /etc/systemd/system/vllm.service "$BACKUP_DIR/backup_$TS/"
cp /opt/vllm/monitor.py "$BACKUP_DIR/backup_$TS/"
find /var/log/vllm -name "*.log" -mtime -3 -exec cp {} "$BACKUP_DIR/backup_$TS/" \;
cd "$BACKUP_DIR"
tar -czf backup_${TS}.tar.gz backup_$TS
rm -rf "backup_$TS"
find "$BACKUP_DIR" -name "backup_*.tar.gz" -mtime +30 -deleteRestore steps:
sudo systemctl stop vllm vllm-monitor
tar -xzf backup_20250115_020000.tar.gz -C /opt/vllm
cp backup_20250115_020000/vllm.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl start vllm
sudo systemctl start vllm-monitor
# Wait ~2 min for model load, then health check
curl http://localhost:8000/healthConclusion
vLLM’s PagedAttention and Continuous Batching provide 2‑3× higher memory efficiency and 4‑5× higher concurrency compared with traditional inference pipelines. Proper quantization, careful tuning of gpu_memory_utilization, max_num_seqs, and max_num_batched_tokens, together with robust monitoring and graceful restart, enable production‑grade large‑model services on a single GPU.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
