Hands‑On LLM Local Deployment: vLLM Inference Optimizations Explained
The article explains why LLM inference is memory‑bound, introduces vLLM’s three core optimizations—Continuous Batching, PagedAttention, and Prefix Caching—shows how to launch a vLLM server, run Python code to benchmark performance, and examines KV‑Cache memory usage with concrete numbers.
Why inference needs optimization
LLM inference requires a full forward pass for every generated token, moving the entire model weight from GPU memory to compute units each time. When serving a single request, tensor cores spend most of the time waiting for data transfer, analogous to a truck delivering only one parcel per trip.
Batching solves this by grouping multiple requests so the model weights are read once and applied to many users, increasing work per memory load.
Three core techniques (the "vLLM three‑blade")
Continuous Batching
Static batching stalls the whole batch until the longest request finishes. Continuous Batching schedules at token granularity: as soon as a request finishes a token, its slot is immediately filled by a new request, keeping the GPU fully occupied.
PagedAttention
Traditional KV‑Cache allocation reserves a large contiguous block per request (e.g., 2048 slots), causing internal fragmentation (unused slots), external fragmentation (gaps between blocks), and overallocation. The vLLM paper reports only 20‑40 % of KV‑Cache memory holds useful data.
PagedAttention splits the KV‑Cache into fixed‑size blocks scattered across memory, tracked by a Block Table, similar to OS virtual memory paging.
Allocate on demand, using exactly what is needed.
Blocks need not be contiguous, eliminating fragmentation.
Blocks are released immediately when a request ends, allowing reuse.
More concurrent requests fit into the same memory budget.
Prefix Caching
Many applications share the same system prompt (e.g., a 500‑token instruction for an AI assistant). Without Prefix Caching each request recomputes the KV‑Cache for the shared prefix, wasting compute. With Prefix Caching the shared prefix is computed once and reused.
Typical scenarios:
Multiple users share a system prompt – compute once, use everywhere.
Multi‑turn conversations – later turns reuse the cached prefix from earlier turns.
The course reports a 75 % cache‑hit rate yields roughly a 4× throughput increase.
Hands‑on: launching a vLLM service
Running a vLLM server requires a single command:
vllm serve Qwen/Qwen3-0.6B --dtype=bfloat16 --max-model-len 4096Parameters: vllm serve: starts the built‑in inference server with PagedAttention, Continuous Batching, and Prefix Caching enabled, listening on port 8000. Qwen/Qwen3-0.6B: model identifier on Hugging Face; the model is downloaded automatically on first run. --dtype=bfloat16: loads weights in BF16 precision. --max-model-len 4096: sets the context window, which vLLM uses to size the KV‑Cache block pool.
The server exposes an OpenAI‑compatible HTTP API. Using the official OpenAI Python SDK requires only a dummy API key:
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")Inspecting token confidence (logprobs)
vLLM can return per‑token log probabilities. Example request with logprobs=True and top_logprobs=5 shows the model assigns a 92.5 % confidence to "Paris" when asked "The capital of France is".
Experiment: Continuous Batching performance
Five concurrent prompts are sent in parallel. The script measures total elapsed time and prints per‑request token usage. Observation: total time is far less than five times the serial execution time because Continuous Batching packs the requests into a single batch.
Experiment: Prefix Caching impact
Five requests share the same system prompt. The /metrics endpoint shows the prefix_cache_queries counter rising from 235 to 550, confirming that the system prompt KV‑Cache is reused, saving recomputation.
KV‑Cache memory accounting (example with Qwen3‑0.6B)
Per‑token KV‑Cache size is calculated as:
num_layers = 28
num_kv_heads = 8 # GQA: 16 Q heads, 8 KV heads
head_dim = 128
dtype_bytes = 2 # BF16
per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes # ≈ 112 KB/tokenResulting memory usage for different context lengths:
64 tokens → 7 MB
256 tokens → 28 MB
1024 tokens → 112 MB
4096 tokens → 448 MB
Ten concurrent requests with a 4096‑token context consume about 4.4 GB of GPU memory, highlighting why efficient KV‑Cache management is critical.
vLLM ecosystem
As of January 2025, vLLM has been installed over 100 k times, a ten‑fold increase in 2024, and is among the AI/ML projects with the most GitHub contributors. It supports models such as Llama, Qwen, DeepSeek, Gemma, Mistral, Granite; hardware including NVIDIA GPUs, AMD Instinct, Intel Gaudi, Google TPU, AWS Neuron, IBM Spyre; and deployment environments ranging from edge to private and public clouds.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
