How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide
This article walks through deploying the high‑performance vLLM LLM inference framework, covering GPU and CPU backend installation, environment setup, offline and online serving, API usage, and a performance comparison that highlights the ten‑fold speed advantage of GPU over CPU.
1 What is vLLM?
vLLM is an efficient, easy‑to‑use large language model (LLM) inference and serving framework, optimized for speed and throughput, especially in high‑concurrency production environments. Developed by the UC Berkeley research team, it is one of the most popular LLM inference engines.
vLLM supports both GPU and CPU backends; the article will describe installation and operation for each.
2 Prerequisites
2.1 Purchase a virtual machine
If a local GPU is unavailable, you can rent a GPU server from cloud providers such as Alibaba Cloud or Tencent Cloud.
Ubuntu 22.04 is recommended; choose a GPU model that meets your needs and increase disk capacity because LLMs consume large storage.
2.2 Virtual environment
It is recommended to use
uvto manage the Python virtual environment. Install it with:
<code>curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env</code>3 Installation
3.1 Using GPU as the vLLM backend
3.1.1 System requirements
Operating system : Linux
Python version : 3.9 ~ 3.12
GPU : Compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A10, A100, L4, H100)
Note: Compute capability defines the hardware features of each NVIDIA GPU architecture and determines which CUDA or Tensor‑Core features are available; it does not directly indicate raw performance.
3.1.2 Install GPU dependencies
Run the following one‑liner to install NVIDIA driver, container toolkit, and runtime (required for Docker later):
<code>curl -sS https://raw.githubusercontent.com/cr7258/hands-on-lab/refs/heads/main/ai/gpu/setup/docker-only-install.sh | bash</code>3.1.3 Install vLLM
Create a Python virtual environment:
<code># (Recommended) Create a new uv environment. Use `--seed` to install pip and setuptools in the environment.
uv venv --python 3.12 --seed
source .venv/bin/activate</code>Then install vLLM:
<code>uv pip install vllm</code>3.2 Using CPU as the vLLM backend
3.3 System requirements
Operating system : Linux
Python version : 3.9 ~ 3.12
Compiler : gcc/g++ ≥ 12.3.0 (optional, recommended)
3.4 Install compilation dependencies
vLLM does not provide pre‑built CPU packages; you must compile from source.
<code>uv venv vllm-cpu --python 3.12 --seed
source vllm-cpu/bin/activate
sudo apt-get update -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12</code>Clone the repository and install build dependencies:
<code>git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
pip install --upgrade pip
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install intel-extension-for-pytorch</code>3.5 Build and install the CPU backend
<code>VLLM_TARGET_DEVICE=cpu python setup.py install</code>3.3 Running vLLM with Docker
vLLM provides an official Docker image. Run the GPU backend with:
<code>docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-1.5B-Instruct</code>Key arguments:
--runtime nvidia: use NVIDIA container runtime.
--gpus all: expose all host GPUs (or specify IDs).
-v …: mount Hugging Face cache to avoid re‑downloading.
-p 8000:8000: expose the OpenAI‑compatible API on port 8000.
--ipc=host(or
--shm-size): give the container access to host shared memory, which PyTorch uses for high‑throughput inference.
For CPU you can use the image
public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5with the same flags.
4 Offline vs. Online Inference
Offline inference processes a batch of inputs without real‑time latency requirements, suitable for tasks like nightly data processing. Online inference serves real‑time requests, requiring low latency and stable resource allocation, typical for chatbots or search assistants.
5 Offline inference example
After installing vLLM, you can run batch generation with the following script (basic.py):
<code># basic.py
# https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic/basic.py
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
def main():
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
print("\nGenerated Outputs:\n" + "-"*60)
for output in outputs:
print(f"Prompt: {output.prompt!r}")
print(f"Output: {output.outputs[0].text!r}")
print("-"*60)
if __name__ == "__main__":
main()
</code>The script uses
SamplingParamsto set temperature 0.8 and top‑p 0.95. Temperature controls randomness; lower values make output more deterministic, higher values increase creativity. Top‑p limits sampling to the smallest set of tokens whose cumulative probability exceeds the threshold.
6 Online inference (OpenAI‑compatible API)
Start the server with:
<code>vllm serve Qwen/Qwen2.5-1.5B-Instruct</code>The server listens on
http://localhost:8000. You can query the model via the OpenAI Completion API:
<code>curl -sS http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt":"San Francisco is a"}' | jq</code>Or use the OpenAI Python SDK after setting
api_key="EMPTY"and
base_url="http://localhost:8000/v1". The same works for the Chat Completion endpoint, allowing multi‑turn conversations.
7 Performance comparison between GPU and CPU backends
Test hardware: 32 vCPU, 188 GiB RAM, NVIDIA A10 GPU. A loop of 100 requests shows the GPU backend generating roughly 130 tokens / s, while the CPU backend generates about 12 tokens / s, i.e., a ten‑fold speed advantage. Prefix caching is disabled on both sides for a fair comparison.
Conclusion
The article provides a comprehensive guide to deploying the high‑performance vLLM LLM inference framework, covering environment preparation, GPU/CPU backend configuration, offline and online serving, and a practical performance benchmark that demonstrates the substantial throughput gain when using a GPU.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.