Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

This guide explains what vLLM is, how its PagedAttention architecture boosts LLM throughput, provides step‑by‑step installation commands, showcases core examples for text generation, chat, embedding and classification, and details advanced performance features such as quantization, LoRA support, and distributed parallelism.

Efficient Ops
Efficient Ops
Efficient Ops
Unlock High‑Throughput LLM Inference with vLLM: Install, Run, and Optimize

1. What is vLLM

vLLM is a high‑throughput, memory‑efficient inference engine for large language models (LLMs) that makes model serving fast, simple, and accessible. Developed originally by the Sky Computing Lab at UC Berkeley, it is now a community‑driven project under the PyTorch Foundation.

vLLM architecture illustration
vLLM architecture illustration

2. Why vLLM Improves Efficiency

vLLM acts like a highly efficient restaurant service system for AI models. Before vLLM, a GPU could only handle a single request at a time (e.g., “write a poem about the moon”). vLLM introduces PagedAttention , which splits GPU memory into fixed‑size pages and uses continuous batching so the GPU stays busy.

Regardless of request length, pages flexibly store the data.

Memory utilization is maximized with minimal waste.

The same hardware can now prepare dozens to hundreds of requests simultaneously.

vLLM integrates seamlessly with mainstream open‑source models on HuggingFace and other platforms, including Transformer‑style LLMs (Llama, Mistral), mixture‑of‑experts models (Mixtral, Deepseek‑V2/V3), embedding models (E5‑Mistral), and multimodal models (LLaVA).

3. Installation & Deployment

Before starting, ensure you have:

Linux

Python 3.10‑3.13

NVIDIA GPU (recommended) or another supported platform

Use uv for fast environment management:

uv venv --python 3.12 --seed</code><code>source .venv/bin/activate</code><code>uv pip install vllm --torch-backend=auto

For specific CUDA versions, run:

# Install a specific CUDA version (11.8, 12.6, or 12.8)</code><code>export VLLM_VERSION=$(curl -s https://api.github.com/releases/latest | jq -r .tag_name | sed 's/^v//')</code><code>export CUDA_VERSION=118  # or 126, 128</code><code>uv pip install https://github.com/vllm-project/vllm/releases/download/${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}

4. Core Examples

4.1 Basic Text Generation

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Output: {generated_text!r}")

4.2 Chat Completion

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.2-1B-Instruct")
sampling_params = llm.get_default_sampling_params()

conversation = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hello! How can I assist you today?"},
    {"role": "user", "content": "Write an essay about higher education."}
]
outputs = llm.chat(conversation, sampling_params)

4.3 Text Embedding

from vllm import LLM

llm = LLM(model="intfloat/e5-small", runner="pooling", enforce_eager=True)
prompts = ["Hello, my name is", "The president of the United States is"]
outputs = llm.embed(prompts)
for prompt, output in zip(prompts, outputs):
    embeds = output.outputs.embedding
    print(f"Prompt: {prompt!r}")
    print(f"Embeddings size: {len(embeds)}")

4.4 Text Classification

from vllm import LLM

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling", enforce_eager=True)
prompts = ["Sample text for classification"]
outputs = llm.classify(prompts)
for output in outputs:
    probs = output.outputs.probs
    print(f"Class probabilities: {probs}")

5. Performance Capabilities

Quantization Support

GPTQ, AWQ, AutoRound quantization

INT4, INT8, FP8 precision

Hardware‑specific optimizations

Advanced Features

Prefix caching for reuse across similar requests

Multi‑LoRA support for efficient fine‑tuning

Streaming output for real‑time token generation

OpenAI‑compatible API

Distributed Computing

Tensor parallelism: model sharding across devices

Pipeline parallelism: sequential processing across devices

Data parallelism: replicate processing to boost throughput

Expert parallelism: dedicated routing for MoE models

PythonGPU accelerationvLLMLLM inference
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.