Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

After discovering that only a few vLLM settings truly impact performance, this guide details how adjusting gpu_memory_utilization, max_num_batched_tokens, and enabling chunked prefill can raise Qwen2.5‑72B‑Instruct throughput from ~1800 to over 2500 tokens/s, improve latency, and provides comprehensive deployment, monitoring, and troubleshooting instructions.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Boost vLLM Inference Throughput by 40% with Three Simple Config Tweaks

Overview

vLLM is a high‑performance LLM inference engine that uses PagedAttention to manage KV‑Cache fragmentation. In a production setup running Qwen2.5‑72B‑Instruct on eight A100‑80GB GPUs with tensor parallelism, baseline throughput was ~1800 tokens/s; after tuning it reached ~2500 tokens/s (≈40 % increase) and P99 latency decreased.

Environment

OS: Ubuntu 22.04 LTS

GPUs: 8 × NVIDIA A100‑80GB (NVLink interconnect)

CUDA: 12.4

Python: 3.11

PyTorch: 2.4.0+cu124

vLLM: 0.6.4.post1

Installation

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.6.4.post1
pip install -e ".[all]"
export TORCH_CUDA_ARCH_LIST="8.0;9.0"
pip install -e . --no-build-isolation

Model download

pip install -U huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download Qwen/Qwen2.5-72B-Instruct \
    --local-dir /data/models/Qwen2.5-72B-Instruct \
    --local-dir-use-symlinks False

Benchmark script

#!/usr/bin/env python3
"""
Benchmark script for vLLM throughput testing.
"""
import time, argparse
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

def generate_prompts(num_prompts: int, prompt_len: int, tokenizer):
    base = "Explain the concept of machine learning in simple terms. " * 50
    tokens = tokenizer.encode(base)[:prompt_len]
    prompt = tokenizer.decode(tokens)
    return [prompt] * num_prompts

def run_benchmark(model_path: str, num_prompts: int = 100,
                  prompt_len: int = 512, output_len: int = 256,
                  tensor_parallel: int = 8, **kwargs):
    llm = LLM(model=model_path,
              tensor_parallel_size=tensor_parallel,
              trust_remote_code=True,
              **kwargs)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    prompts = generate_prompts(num_prompts, prompt_len, tokenizer)
    sampling_params = SamplingParams(temperature=0.8,
                                     top_p=0.95,
                                     max_tokens=output_len,
                                     ignore_eos=True)
    # warm‑up
    _ = llm.generate(prompts[:10], sampling_params)
    # benchmark
    start = time.perf_counter()
    outputs = llm.generate(prompts, sampling_params)
    elapsed = time.perf_counter() - start
    total_output = sum(len(o.outputs[0].token_ids) for o in outputs)
    print("="*50)
    print(f"Throughput: {total_output/elapsed:.2f} tokens/s")
    print(f"Latency per request: {elapsed/num_prompts*1000:.2f} ms")
    print("="*50)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", required=True)
    parser.add_argument("--num-prompts", type=int, default=100)
    parser.add_argument("--prompt-len", type=int, default=512)
    parser.add_argument("--output-len", type=int, default=256)
    parser.add_argument("--tp", type=int, default=8)
    args = parser.parse_args()
    run_benchmark(args.model, args.num_prompts,
                  args.prompt_len, args.output_len, args.tp)

Key configuration changes ("magic three")

gpu_memory_utilization

# default
llm = LLM(model=model_path, gpu_memory_utilization=0.9)

# optimized
llm = LLM(model=model_path, gpu_memory_utilization=0.95)

Increasing to 0.95 frees ~4 GB per GPU for KV‑Cache, giving ~15 % throughput gain. Keep ≤0.9 for very long contexts to avoid OOM.

max_num_batched_tokens

# default (auto‑calculated)
# ...

# optimized
llm = LLM(model=model_path,
          max_num_batched_tokens=32768,
          max_num_seqs=256)

Setting the batch token limit to 32768 allows larger batches; on 8 × A100 this yields ~18 % throughput increase.

enable_chunked_prefill

# default
llm = LLM(model=model_path, enable_chunked_prefill=False)

# optimized
llm = LLM(model=model_path, enable_chunked_prefill=True)

Chunked prefill splits long prompts into smaller pieces that interleave with decode, preventing long requests from blocking short ones. In mixed‑length workloads this adds ~12 % throughput and reduces P99 latency by ~20 %.

Full optimized configuration

from vllm import LLM, SamplingParams

llm = LLM(
    model="/data/models/Qwen2.5-72B-Instruct",
    tensor_parallel_size=8,
    gpu_memory_utilization=0.95,
    max_num_batched_tokens=32768,
    enable_chunked_prefill=True,
    max_num_seqs=256,
    max_model_len=8192,          # adjust if 32k context not needed
    trust_remote_code=True,
    dtype="bfloat16",            # A100/H100 benefit from bf16
    kv_cache_dtype="fp8_e5m2",   # optional experimental cache quantization
)

Deployment options

OpenAI‑compatible API server

#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-72B-Instruct \
    --served-model-name qwen2.5-72b \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.95 \
    --max-num-batched-tokens 32768 \
    --enable-chunked-prefill \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --dtype bfloat16 \
    --trust-remote-code \
    --disable-log-requests \
    --uvicorn-log-level warning

Systemd service

# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI API Server
After=network.target

[Service]
Type=simple
User=deploy
Group=deploy
WorkingDirectory=/opt/vllm
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7"
Environment="VLLM_LOGGING_LEVEL=WARNING"
ExecStart=/opt/vllm/venv/bin/python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-72B-Instruct \
    --served-model-name qwen2.5-72b \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.95 \
    --max-num-batched-tokens 32768 \
    --enable-chunked-prefill \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --dtype bfloat16 \
    --trust-remote-code
Restart=always
RestartSec=10
LimitNOFILE=65536
LimitNPROC=65536

[Install]
WantedBy=multi-user.target

Docker deployment

# Dockerfile.vllm
FROM nvidia/cuda:12.4.0-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y \
    python3.11 python3.11-venv python3-pip git curl wget && \
    rm -rf /var/lib/apt/lists/*

RUN python3.11 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN pip install --no-cache-dir \
    vllm==0.6.4.post1 \
    torch==2.4.0 \
    transformers>=4.45.0

RUN mkdir -p /models
WORKDIR /app
COPY healthcheck.py /app/
EXPOSE 8000
ENTRYPOINT ["python","-m","vllm.entrypoints.openai.api_server"]

Kubernetes deployment

# vllm-deployment.yaml (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: ai-inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
      containers:
      - name: vllm
        image: your-registry/vllm:0.6.4
        ports:
        - containerPort: 8000
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3,4,5,6,7"
        - name: VLLM_LOGGING_LEVEL
          value: "WARNING"
        args:
        - --model
        - /models/Qwen2.5-72B-Instruct
        - --served-model-name
        - qwen2.5-72b
        - --host
        - 0.0.0.0
        - --port
        - "8000"
        - --tensor-parallel-size
        - "8"
        - --gpu-memory-utilization
        - "0.95"
        - --max-num-batched-tokens
        - "32768"
        - --enable-chunked-prefill
        - --max-num-seqs
        - "256"
        - --max-model-len
        - "8192"
        - --dtype
        - bfloat16
        - --trust-remote-code
        resources:
          limits:
            nvidia.com/gpu: 8
            memory: "320Gi"
            cpu: "32"
          requests:
            nvidia.com/gpu: 8
            memory: "256Gi"
            cpu: "16"
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 64Gi

Best practices

Model‑size specific tuning

7 B model (single A100): gpu_memory_utilization=0.92, max_num_batched_tokens=16384, max_num_seqs=512, enable_chunked_prefill=True 14 B model (2 × A100): gpu_memory_utilization=0.93, max_num_batched_tokens=24576, max_num_seqs=384, enable_chunked_prefill=True 72 B model (8 × A100): values shown in the full configuration above.

KV‑Cache quantization

llm = LLM(model="Qwen2.5-72B-Instruct",
          tensor_parallel_size=8,
          kv_cache_dtype="fp8_e5m2")  # or "fp8_e4m3"

FP8 cache reduces memory usage by ~50 % and can increase throughput 20‑30 % with minor quality loss.

Prefix caching

llm = LLM(model="Qwen2.5-72B-Instruct",
          enable_prefix_caching=True)

When many requests share a long system prompt (e.g., 2000‑token knowledge base), first‑token latency drops from ~180 ms to ~45 ms.

Security hardening

Enable API‑key authentication via --api-key YOUR_KEY or the VLLM_API_KEY environment variable.

Place an external rate‑limiter (NGINX, API Gateway) in front of the server.

Validate prompt length and filter prohibited patterns before forwarding to the model.

High availability

Deploy multiple replicas behind a load balancer; use Kubernetes HorizontalPodAutoscaler based on vllm_num_requests_running or custom metrics.

Handle SIGTERM gracefully to finish in‑flight requests (example wrapper shown in source).

Common error handling

CUDA OOM : lower gpu_memory_utilization, reduce max_num_batched_tokens, or enable KV‑Cache quantization.

Tensor‑Parallel NCCL errors : verify GPU interconnect, match NCCL versions, set NCCL_DEBUG=INFO and NCCL_IB_DISABLE=1 if InfiniBand is unavailable.

Request timeouts : enable enable_chunked_prefill, increase client timeout, monitor vllm_num_requests_waiting.

Model loading failures : ensure model directory contains pytorch_model.bin or model.safetensors, use --trust-remote-code, and keep transformers version compatible.

Monitoring and observability

vLLM exposes Prometheus metrics at /metrics. Key metrics include: vllm_num_requests_running (Gauge) – active requests. vllm_num_requests_waiting (Gauge) – queued requests. vllm_gpu_cache_usage_perc (Gauge) – KV‑Cache utilization. vllm_request_success_total (Counter) – total successful requests. vllm_time_to_first_token_seconds (Histogram) – TTFT. vllm_time_per_output_token_seconds (Histogram) – per‑token latency.

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['vllm-server:8000']
    metrics_path: /metrics
    scrape_interval: 15s

Backup and recovery

Model files: rsync -avz /data/models/ backup:/backup/models/ or rclone sync /data/models/ s3:bucket/models/.

Service files: archive /etc/systemd/system/vllm.service, startup scripts, and any reverse‑proxy configs.

Monitoring configs: export Prometheus and Grafana configmaps.

Result

Adjusting the three parameters – gpu_memory_utilization (0.9 → 0.95), max_num_batched_tokens (auto → 32768), and enable_chunked_prefill (false → true) – yields roughly a 40 % throughput increase for a 72 B model on eight A100 GPUs while also reducing latency. The guide provides end‑to‑end installation, deployment (bare metal, Docker, Kubernetes), security hardening, high‑availability patterns, troubleshooting, and observability instructions.

DockerPythoninference optimizationKubernetesvLLMthroughputGPU
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.