Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

This guide walks through the complete process of deploying a high‑throughput large language model inference service using vLLM, covering environment preparation, installation, configuration tuning, performance testing, real‑world case studies, monitoring, troubleshooting, and backup strategies for production‑grade deployments.

Ops Community
Ops Community
Ops Community
Boost LLM Inference Speed: Build a High‑Concurrency vLLM Service with Best‑Practice Ops

Overview

Original handcrafted inference achieved ~50 QPS on A100; switching to vLLM with PagedAttention and Continuous Batching raised throughput >200 QPS with <2 s latency.

Key Features

PagedAttention dynamically pages KV‑Cache, improving GPU memory utilization 2‑3×.

Continuous Batching lets requests join/leave a batch at any time, reducing latency for mixed‑length workloads.

OpenAI‑compatible API provides /v1/completions and /v1/chat/completions endpoints.

Environment Requirements

OS: Ubuntu 22.04 LTS

NVIDIA driver ≥ 535, CUDA 12.1+

Python 3.9‑3.11, PyTorch 2.1.2 + cu121

GPU: A100/H100/A10/L40 with ≥40 GB VRAM (24 GB minimum)

System RAM ≥ 64 GB, NVMe ≥ 1 TB

Installation Steps

System preparation

# Verify OS and GPU
cat /etc/os-release
nvidia-smi
nvcc --version
# Update system
sudo apt update && sudo apt upgrade -y
# Install build tools
sudo apt install -y build-essential cmake git curl wget python3.10 python3.10-venv python3.10-dev python3-pip ccache ninja-build libopenmpi-dev
# Install CUDA 12.1
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-1
# Set environment
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Python environment

# Create directory
sudo mkdir -p /opt/vllm
sudo chown $USER:$USER /opt/vllm
# Virtualenv
python3.10 -m venv /opt/vllm/venv
source /opt/vllm/venv/bin/activate
pip install --upgrade pip setuptools wheel
# Install PyTorch (CUDA 12.1)
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
# Install vLLM
pip install vllm==0.2.7
# Optional: build from source
# git clone https://github.com/vllm-project/vllm.git
# cd vllm && pip install -e .
# Verify
python -c "import vllm; print(vllm.__version__)"
# Additional runtime deps
pip install fastapi==0.104.1 uvicorn[standard]==0.24.0 pydantic==2.5.0 prometheus-client==0.19.0 aiohttp requests

Model preparation

# Create model directory
sudo mkdir -p /data/models && sudo chown $USER:$USER /data/models
# Download Llama‑2‑13B‑Chat (example)
pip install huggingface-hub
huggingface-cli download meta-llama/Llama-2-13b-chat-hf --local-dir llama2-13b-chat
# Verify files: config.json, tokenizer.json, pytorch_model-*.bin

Optional 4‑bit AWQ quantization

# quantize_awq.py
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "/data/models/llama2-13b-chat"
quant_path = "/data/models/llama2-13b-chat-awq"
model = AutoAWQForCausalLM.from_pretrained(model_path, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"Quantized model saved to {quant_path}")

Run with python quantize_awq.py. AWQ reduces a 13B model from ~26 GB to ~7 GB and roughly doubles throughput.

vLLM Configuration

model:
  name: meta-llama/Llama-2-13b-chat-hf
  path: /data/models/llama2-13b-chat-awq
  tokenizer: /data/models/llama2-13b-chat-awq
  quantization: awq
  dtype: auto
  trust_remote_code: false
server:
  host: 0.0.0.0
  port: 8000
  api_key: "your-secret-api-key"
  log_level: info
  timeout: 600
engine:
  tensor_parallel_size: 1
  pipeline_parallel_size: 1
  gpu_memory_utilization: 0.90
  max_num_batched_tokens: 8192
  max_num_seqs: 256
  max_model_len: 4096
  block_size: 16
  swap_space: 4
  enable_prefix_caching: true
generation:
  max_tokens: 2048
  temperature: 0.8
  top_p: 0.95
  top_k: 50
  stop: ["</s>", "[INST]", "[/INST]"]

Important parameter notes

gpu_memory_utilization

should stay below 0.95; 0.88‑0.92 is safe for 24 GB GPUs. max_num_seqs and max_num_batched_tokens must be balanced; a rule of thumb is max_num_batched_tokens ≈ max_num_seqs × avg_seq_len × 2. block_size = 16 works for most workloads; use 32 only for sequences > 4 k tokens.

Enable enable_prefix_caching for multi‑turn conversations.

Service Startup

# /opt/vllm/start_server.sh
#!/bin/bash
source /opt/vllm/venv/bin/activate
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/llama2-13b-chat-awq \
    --quantization awq \
    --dtype auto \
    --gpu-memory-utilization 0.90 \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256 \
    --max-model-len 4096 \
    --block-size 16 \
    --swap-space 4 \
    --enable-prefix-caching \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-api-key \
    2>&1 | tee /var/log/vllm/server.log

systemd unit

[Unit]
Description=vLLM Inference Service
After=network.target

[Service]
Type=simple
User=$USER
WorkingDirectory=/opt/vllm
Environment="PATH=/opt/vllm/venv/bin:/usr/local/cuda-12.1/bin:/usr/bin"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64"
ExecStart=/opt/vllm/start_server.sh
Restart=always
RestartSec=10
StandardOutput=append:/var/log/vllm/stdout.log
StandardError=append:/var/log/vllm/stderr.log
LimitNOFILE=65536
LimitNPROC=4096

[Install]
WantedBy=multi-user.target

Verification

Health check: curl http://localhost:8000/health{"status":"ok"} Model list: curl http://localhost:8000/v1/models Completion example:

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-api-key" \
  -d '{"model":"/data/models/llama2-13b-chat-awq","prompt":"What is the capital of France?","max_tokens":100,"temperature":0.7}'

Chat example (streaming):

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-api-key" \
  -d '{"model":"/data/models/llama2-13b-chat-awq","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"写一首诗"}],"stream":true,"max_tokens":200}'

Load testing

Install Locust and run a 5‑minute test with 50 concurrent users:

pip install locust
locust -f locustfile.py --host=http://localhost:8000 \
    --users=50 --spawn-rate=5 --run-time=5m --headless

Target metrics: RPS > 50, p50 latency < 2 s, GPU utilization 70‑95 %.

Best‑practice checklist

Quantization (AWQ or GPTQ) is essential for 13B models on 40 GB GPUs.

Never set gpu_memory_utilization to 1.0; keep a 10‑15 % safety margin.

Enable enable_prefix_caching for multi‑turn dialogs.

Adjust block_size according to typical sequence length.

Use tensor parallelism ( --tensor-parallel-size) when multiple GPUs are available; the size must be a power of two.

Configure API gateway timeouts (e.g., proxy_read_timeout 120s) and rate limiting to avoid overload.

Prefer graceful shutdown via systemctl stop vllm to let in‑flight requests finish.

Enable systemd Restart=always to recover from CUDA crashes.

Monitoring & Alerting

Expose vLLM stats at /stats and combine with Prometheus + Grafana. Example Prometheus scrape:

# prometheus.yml snippet
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:9090']   # vllm‑monitor exporter
  - job_name: 'nvidia_gpu'
    static_configs:
      - targets: ['localhost:9835']   # nvidia‑gpu‑exporter

Key metrics to monitor:

GPU utilization (70‑95 %)

GPU memory usage (≈ gpu_memory_utilization × total)

vLLM active requests and queue size

Prompt and generation throughput (tokens/s)

API p95 latency (< 3 s)

Error rate (< 0.1 %)

Sample alert rule for high memory usage:

- alert: VLLMHighMemoryUsage
  expr: vllm_gpu_memory_used_mb / nvidia_gpu_memory_total_mb > 0.95
  for: 5m
  annotations:
    summary: "vLLM GPU memory usage > 95%"

Backup & Restore

Backup script saves configuration, startup script, systemd unit, monitoring scripts and recent logs, then archives them:

#!/bin/bash
BACKUP_DIR="/data/backups/vllm"
TS=$(date +%Y%m%d_%H%M%S)
mkdir -p "$BACKUP_DIR/backup_$TS"
cp /opt/vllm/config.yaml "$BACKUP_DIR/backup_$TS/"
cp /opt/vllm/start_server.sh "$BACKUP_DIR/backup_$TS/"
cp /etc/systemd/system/vllm.service "$BACKUP_DIR/backup_$TS/"
cp /opt/vllm/monitor.py "$BACKUP_DIR/backup_$TS/"
find /var/log/vllm -name "*.log" -mtime -3 -exec cp {} "$BACKUP_DIR/backup_$TS/" \;
cd "$BACKUP_DIR"
tar -czf backup_${TS}.tar.gz backup_$TS
rm -rf "backup_$TS"
find "$BACKUP_DIR" -name "backup_*.tar.gz" -mtime +30 -delete

Restore steps:

sudo systemctl stop vllm vllm-monitor
tar -xzf backup_20250115_020000.tar.gz -C /opt/vllm
cp backup_20250115_020000/vllm.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl start vllm
sudo systemctl start vllm-monitor
# Wait ~2 min for model load, then health check
curl http://localhost:8000/health

Conclusion

vLLM’s PagedAttention and Continuous Batching provide 2‑3× higher memory efficiency and 4‑5× higher concurrency compared with traditional inference pipelines. Proper quantization, careful tuning of gpu_memory_utilization, max_num_seqs, and max_num_batched_tokens, together with robust monitoring and graceful restart, enable production‑grade large‑model services on a single GPU.

deploymentvLLMHigh ConcurrencyLLM inferenceGPU OptimizationOpenAI API
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.