11 min read

Hands‑On LLM Local Deployment: vLLM Inference Optimizations Explained

The article explains why LLM inference is memory‑bound, introduces vLLM’s three core optimizations—Continuous Batching, PagedAttention, and Prefix Caching—shows how to launch a vLLM server, run Python code to benchmark performance, and examines KV‑Cache memory usage with concrete numbers.

Old Zhang's AI Learning

Jun 7, 2026

Hands‑On LLM Local Deployment: vLLM Inference Optimizations Explained

Why inference needs optimization

LLM inference requires a full forward pass for every generated token, moving the entire model weight from GPU memory to compute units each time. When serving a single request, tensor cores spend most of the time waiting for data transfer, analogous to a truck delivering only one parcel per trip.

Batching solves this by grouping multiple requests so the model weights are read once and applied to many users, increasing work per memory load.

Three core techniques (the "vLLM three‑blade")

Continuous Batching

Static batching stalls the whole batch until the longest request finishes. Continuous Batching schedules at token granularity: as soon as a request finishes a token, its slot is immediately filled by a new request, keeping the GPU fully occupied.

PagedAttention

Traditional KV‑Cache allocation reserves a large contiguous block per request (e.g., 2048 slots), causing internal fragmentation (unused slots), external fragmentation (gaps between blocks), and overallocation. The vLLM paper reports only 20‑40 % of KV‑Cache memory holds useful data.

PagedAttention splits the KV‑Cache into fixed‑size blocks scattered across memory, tracked by a Block Table, similar to OS virtual memory paging.

Allocate on demand, using exactly what is needed.

Blocks need not be contiguous, eliminating fragmentation.

Blocks are released immediately when a request ends, allowing reuse.

More concurrent requests fit into the same memory budget.

Prefix Caching

Many applications share the same system prompt (e.g., a 500‑token instruction for an AI assistant). Without Prefix Caching each request recomputes the KV‑Cache for the shared prefix, wasting compute. With Prefix Caching the shared prefix is computed once and reused.

Typical scenarios:

Multiple users share a system prompt – compute once, use everywhere.

Multi‑turn conversations – later turns reuse the cached prefix from earlier turns.

The course reports a 75 % cache‑hit rate yields roughly a 4× throughput increase.

Hands‑on: launching a vLLM service

Running a vLLM server requires a single command:

vllm serve Qwen/Qwen3-0.6B --dtype=bfloat16 --max-model-len 4096

Parameters: vllm serve: starts the built‑in inference server with PagedAttention, Continuous Batching, and Prefix Caching enabled, listening on port 8000. Qwen/Qwen3-0.6B: model identifier on Hugging Face; the model is downloaded automatically on first run. --dtype=bfloat16: loads weights in BF16 precision. --max-model-len 4096: sets the context window, which vLLM uses to size the KV‑Cache block pool.

The server exposes an OpenAI‑compatible HTTP API. Using the official OpenAI Python SDK requires only a dummy API key:

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

Inspecting token confidence (logprobs)

vLLM can return per‑token log probabilities. Example request with logprobs=True and top_logprobs=5 shows the model assigns a 92.5 % confidence to "Paris" when asked "The capital of France is".

Experiment: Continuous Batching performance

Five concurrent prompts are sent in parallel. The script measures total elapsed time and prints per‑request token usage. Observation: total time is far less than five times the serial execution time because Continuous Batching packs the requests into a single batch.

Experiment: Prefix Caching impact

Five requests share the same system prompt. The /metrics endpoint shows the prefix_cache_queries counter rising from 235 to 550, confirming that the system prompt KV‑Cache is reused, saving recomputation.

KV‑Cache memory accounting (example with Qwen3‑0.6B)

Per‑token KV‑Cache size is calculated as:

num_layers = 28
num_kv_heads = 8  # GQA: 16 Q heads, 8 KV heads
head_dim = 128
dtype_bytes = 2  # BF16
per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes  # ≈ 112 KB/token

Resulting memory usage for different context lengths:

64 tokens → 7 MB

256 tokens → 28 MB

1024 tokens → 112 MB

4096 tokens → 448 MB

Ten concurrent requests with a 4096‑token context consume about 4.4 GB of GPU memory, highlighting why efficient KV‑Cache management is critical.

vLLM ecosystem

As of January 2025, vLLM has been installed over 100 k times, a ten‑fold increase in 2024, and is among the AI/ML projects with the most GitHub contributors. It supports models such as Llama, Qwen, DeepSeek, Gemma, Mistral, Granite; hardware including NVIDIA GPUs, AMD Instinct, Intel Gaudi, Google TPU, AWS Neuron, IBM Spyre; and deployment environments ranging from edge to private and public clouds.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python vLLM llm-inference Continuous Batching KV cache PagedAttention prefix caching

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.