Deploy High‑Performance Local LLMs with vLLM: A Step‑by‑Step Guide
This article walks through installing and configuring vLLM for local large language model inference, compares it with Ollama and LM Studio, details environment setup, model download, testing scripts, and shows how to expose an OpenAI‑compatible API for production use.
vLLM Overview
vLLM is an open‑source high‑performance inference engine for large language models (LLMs). It implements PagedAttention with continuous batching, supports multiple quantization formats (GPTQ, AWQ, INT4/8, FP8), and provides multi‑GPU parallelism (tensor, pipeline, data, expert). The library integrates with HuggingFace, offers an OpenAI‑compatible API, and supports LoRA adapters.
Tool Comparison
vLLM : Production‑grade inference with high throughput; requires NVIDIA GPU and a more complex setup.
Ollama : Lightweight local tool with one‑click installation; lower performance and fewer features.
LM Studio : Desktop UI with built‑in model marketplace; resource‑heavy and closed source.
Environment Preparation
Hardware : NVIDIA GPU with ≥20 GB VRAM, ≥16 GB RAM, ≥50 GB SSD.
Software : Linux/macOS/Windows, Python 3.8‑3.12, CUDA 11.8+, package manager uv or pip.
Test platform : macOS 15.6, Python 3.12, UV 0.7.3, PyTorch 2.0+, ModelScope (recommended for China) or HuggingFace.
Project Initialization
01 Create project and install dependencies
mkdir vllm-rag
cd vllm-rag
uv init --python 3.12
source .venv/bin/activate
uv add torch modelscope vllmDependencies: torch: PyTorch deep‑learning framework. vllm: High‑performance LLM inference engine. modelscope: Alibaba Cloud model download tool.
02 Verify PyTorch installation
uv run test_pytorch.pyThe script should confirm successful CUDA detection.
03 Model download (ModelScope example)
# model_download.py
from modelscope import snapshot_download
model_dir = snapshot_download(
'Qwen/Qwen3-8B',
cache_dir='/path/to/models',
revision='master'
)Replace /path/to/models with a local directory and run:
uv run model_download.pyvLLM Inference Test
Using the Qwen/Qwen3-8B model (≈16‑20 GB VRAM) two inference modes are demonstrated.
Thinking mode (default)
Prompt: '<|im_start|>user
给我一个关于大模型的简短介绍。<|im_end|>
<|im_start|>assistant
'
Response: '<think>
好的,用户让我提供一个关于大模型的简短介绍。...(省略思考过程)</think>
大模型(Large Language Models, LLMs)是基于深度学习的参数量巨大的人工智能模型,通常包含数十亿甚至数万亿个参数。它们通过海量文本数据训练,具备强大的语言理解、生成和推理能力,可完成文本生成、问答、代码编写、多语言翻译等任务。典型代表如GPT、BERT、PaLM等。大模型的核心优势在于其泛化能力,能通过预训练和微调适应多种应用场景,但也面临算力消耗高、训练成本大等挑战。'Non‑thinking mode
Disable the reasoning parser by setting enable_thinking=False and adjust inference parameters.
Prompt: '<|im_start|>user
给我一个关于大模型的简短介绍。<|im_end|>
<|im_start|>assistant
'
Response: '大模型(Large Model)是指参数量巨大、具有强大语言理解和生成能力的深度学习模型,通常基于Transformer架构。它们能够处理复杂的自然语言任务,如文本生成、翻译、问答、代码编写等。大模型通过海量数据训练,具备强大的泛化能力和上下文理解能力,广泛应用于智能客服、内容创作、数据分析等领域。代表模型包括GPT、BERT、Ernie Bot等。'Deploying an OpenAI‑compatible Service
Start the server with ModelScope enabled:
VLLM_USE_MODELSCOPE=true vllm serve \
/Volumes/Data1/LLMs/vllm/models/Qwen/Qwen3-8B \
--served-model-name Qwen3-8B \
--max_model_len 2048 \
--reasoning-parser deepseek_r1Verify the model list:
curl http://localhost:8000/v1/modelsThe JSON response confirms model ID, name, and permissions (sampling, view, etc.).
Simple completion request:
curl --location 'http://localhost:8000/v1/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "Qwen3-8B",
"prompt": "/no_think 3的阶乘是多少?",
"max_tokens": 2000,
"temperature": 0
}'Result:
{
"id": "cmpl-fb6ecf0c554d4ad984cabc9e8a7fc53a",
"object": "text_completion",
"created": 1757429422,
"model": "Qwen3-8B",
"choices": [{
"index": 0,
"text": " 3的阶乘是3×2×1=6。所以,3的阶乘是6。",
"logprobs": null,
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"total_tokens": 34,
"completion_tokens": 24
}
}Key Takeaways
vLLM achieves production‑grade throughput (up to 23× higher than baseline) via PagedAttention and continuous batching.
Supports diverse quantization formats and multi‑GPU parallelism, making it suitable for large‑scale deployments.
OpenAI‑compatible API enables seamless integration with existing client code.
Data source: https://docs.vllm.ai/en/latest/index.html
Eric Tech Circle
Backend team lead & architect with 10+ years experience, full‑stack engineer, sharing insights and solo development practice.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
