How to Maximize vLLM Throughput: Batch Size, Quantization, and Monitoring Tips
This guide explains how to unleash vLLM’s full potential by optimizing batch size, leveraging 4‑bit quantization, tuning concurrency parameters, planning capacity with token‑per‑second metrics, and implementing robust monitoring to balance latency, cost, and scalability in production deployments.
Why vLLM Matters
vLLM provides an OpenAI‑compatible API powered by continuous batching and PagedAttention, which efficiently reuses KV cache pages, maximizes GPU utilization, and enables streaming token output.
Batch Size Trade‑offs
Larger batches increase throughput but raise tail latency. Choose batch size based on workload: chat‑style interactions need low TTFT (first token latency) and thus smaller batches, while batch jobs or RAG pipelines can tolerate higher TTFT for higher throughput.
Limit the maximum tokens per request at the gateway to prevent a single large request from blocking the queue. Prefer many medium‑sized prompts over a few gigantic ones; continuous batching performs best with regular shapes. Classify prompts by expected output length (short/medium/long) and run separate vLLM workers for each class to stabilize latency.
Prefix Cache Benefits
When multiple requests share the same prefix—system prompts, few‑shot examples, or retrieval‑generated instructions—vLLM can reuse the KV cache, delivering zero‑cost acceleration.
To capture this benefit, standardize system prompts across tenants, keep few‑shot examples identical, and place variable user input outside the cached prefix. In RAG scenarios, cache the template and instruction, appending only the retrieved facts per request.
Quantization as a Performance Booster
Applying 4‑bit weight quantization with AWQ or RTN dramatically reduces memory usage while keeping perplexity virtually unchanged, making it the default for server‑side deployments. KV cache can also be quantized, allowing more concurrent sequences at the cost of slight quality loss for very long generations.
Quantize when GPU memory is tight or the scheduler cannot fit enough sequences; the parallelism gains usually outweigh the minor quality drop from full‑precision weights.
Key Concurrency Parameters
--max-num-seqslimits concurrent sequences (A100 GPUs start around 64‑128, increase until TTFT degrades). --max-model-len should not be set to the model’s theoretical maximum unless required; a smaller limit yields smaller KV pages and higher parallelism. --tensor-parallel-size splits a large model across multiple GPUs; fast interconnects (NVLink) are essential, and batch size must be large enough to hide communication overhead. --gpu-memory-utilization reserve 10‑15% headroom for traffic spikes to avoid OOM.
Never assume the scheduler will auto‑tune everything; empirical testing is mandatory.
Capacity Planning by Token Rate
Plan capacity using input + output token‑per‑second rather than QPS. Compute the sustained token rate C for a chosen batch shape on a single GPU; total capacity ≈ GPU count × C × utilization. Keep utilization between 70‑85% to absorb spikes; beyond that, horizontal scaling is needed.
Production Configuration Example
docker run --gpus all --rm -p 8000:8000 \
-v /models:/models \
vllm/server:latest \
--model /models/Qwen2.5-7B-Instruct-AWQ \
--dtype auto \
--tensor-parallel-size 1 \
--max-num-seqs 128 \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--trust-remote-code falseAWQ model uses 4‑bit weight quantization for high deployment density. The --enforce-eager flag avoids long CUDA‑graph warm‑up under mixed traffic; disable it only when CUDA‑graph optimizations are required. --trust-remote-code=false maintains security in multi‑tenant environments.
OpenAI‑Compatible Request Example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-Instruct-AWQ",
"messages": [{"role": "user", "content": "Write a haiku about GPUs"}],
"stream": true,
"max_tokens": 128,
"temperature": 0.3
}'Scheduling and Fairness
Separate short‑generation and long‑generation workloads into two pools: a short‑task priority pool and a long‑task pool. Keep TTFT reasonable while preserving batch throughput. Perform admission control at the gateway, rate‑limit tokens per tenant, and let vLLM focus on batching without handling flow control. Implement back‑pressure: set server‑side timeouts and maximum queue lengths to prevent slow consumers from stalling streaming output.
RAG Token Length Considerations
For 7B models, keep context windows around 2‑3k tokens; longer contexts increase quadratic attention cost with limited quality gains. After retrieval, prune near‑duplicate chunks and retain high‑scoring sentences. A static prefix combined with dynamic facts yields the highest prefix‑cache hit rate.
Monitoring Essentials
Dashboards should display at least: p50 and p95 TTFT, input/output/total token‑per‑second, active sequence count, KV cache utilization, batch‑size distribution over time, scheduler queue length, admission‑reject rate, OOM and eviction events.
When active sequences saturate or KV cache approaches 100%, p95 TTFT spikes, indicating capacity limits—scale out or reduce model length.
Common Pitfalls and Solutions
Setting --max-model-len too high creates huge KV pages and hurts parallelism; adjust to a reasonable value and only enable long context when necessary.
Randomized prompts per tenant prevent prefix reuse; standardize templates to enable caching.
Unrestricted max tokens allow a single request to monopolize the scheduler; enforce limits at the endpoint.
If a single worker leaves GPU idle, add multiple workers behind an intelligent gateway and shard traffic by token bucket.
Conclusion
The core value of vLLM lies not in prompt engineering but in keeping the GPU continuously busy. Treat tokens as a budget, design reusable prefixes, avoid oversized context windows, set realistic concurrency limits, and throughput will rise without sudden failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
