Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize
This guide explains what vLLM is, how its PagedAttention architecture boosts LLM throughput, provides step‑by‑step installation commands, showcases core examples for text generation, chat, embedding and classification, and details advanced performance features such as quantization, LoRA support, and distributed parallelism.
1. What is vLLM
vLLM is a high‑throughput, memory‑efficient inference engine for large language models (LLMs) that makes model serving fast, simple, and accessible. Developed originally by the Sky Computing Lab at UC Berkeley, it is now a community‑driven project under the PyTorch Foundation.
2. Why vLLM Improves Efficiency
vLLM acts like a highly efficient restaurant service system for AI models. Before vLLM, a GPU could only handle a single request at a time (e.g., “write a poem about the moon”). vLLM introduces PagedAttention , which splits GPU memory into fixed‑size pages and uses continuous batching so the GPU stays busy.
Regardless of request length, pages flexibly store the data.
Memory utilization is maximized with minimal waste.
The same hardware can now prepare dozens to hundreds of requests simultaneously.
vLLM integrates seamlessly with mainstream open‑source models on HuggingFace and other platforms, including Transformer‑style LLMs (Llama, Mistral), mixture‑of‑experts models (Mixtral, Deepseek‑V2/V3), embedding models (E5‑Mistral), and multimodal models (LLaVA).
3. Installation & Deployment
Before starting, ensure you have:
Linux
Python 3.10‑3.13
NVIDIA GPU (recommended) or another supported platform
Use uv for fast environment management:
uv venv --python 3.12 --seed</code><code>source .venv/bin/activate</code><code>uv pip install vllm --torch-backend=autoFor specific CUDA versions, run:
# Install a specific CUDA version (11.8, 12.6, or 12.8)</code><code>export VLLM_VERSION=$(curl -s https://api.github.com/releases/latest | jq -r .tag_name | sed 's/^v//')</code><code>export CUDA_VERSION=118 # or 126, 128</code><code>uv pip install https://github.com/vllm-project/vllm/releases/download/${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}4. Core Examples
4.1 Basic Text Generation
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Output: {generated_text!r}")4.2 Chat Completion
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.2-1B-Instruct")
sampling_params = llm.get_default_sampling_params()
conversation = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I assist you today?"},
{"role": "user", "content": "Write an essay about higher education."}
]
outputs = llm.chat(conversation, sampling_params)4.3 Text Embedding
from vllm import LLM
llm = LLM(model="intfloat/e5-small", runner="pooling", enforce_eager=True)
prompts = ["Hello, my name is", "The president of the United States is"]
outputs = llm.embed(prompts)
for prompt, output in zip(prompts, outputs):
embeds = output.outputs.embedding
print(f"Prompt: {prompt!r}")
print(f"Embeddings size: {len(embeds)}")4.4 Text Classification
from vllm import LLM
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling", enforce_eager=True)
prompts = ["Sample text for classification"]
outputs = llm.classify(prompts)
for output in outputs:
probs = output.outputs.probs
print(f"Class probabilities: {probs}")5. Performance Capabilities
Quantization Support
GPTQ, AWQ, AutoRound quantization
INT4, INT8, FP8 precision
Hardware‑specific optimizations
Advanced Features
Prefix caching for reuse across similar requests
Multi‑LoRA support for efficient fine‑tuning
Streaming output for real‑time token generation
OpenAI‑compatible API
Distributed Computing
Tensor parallelism: model sharding across devices
Pipeline parallelism: sequential processing across devices
Data parallelism: replicate processing to boost throughput
Expert parallelism: dedicated routing for MoE models
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
