Artificial Intelligence 38 min read

How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

This article walks through deploying the high‑performance vLLM LLM inference framework, covering GPU and CPU backend installation, environment setup, offline and online serving, API usage, and a performance comparison that highlights the ten‑fold speed advantage of GPU over CPU.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Deploy vLLM for Fast LLM Inference on GPU and CPU – A Step‑by‑Step Guide

1 What is vLLM?

vLLM is an efficient, easy‑to‑use large language model (LLM) inference and serving framework, optimized for speed and throughput, especially in high‑concurrency production environments. Developed by the UC Berkeley research team, it is one of the most popular LLM inference engines.

vLLM supports both GPU and CPU backends; the article will describe installation and operation for each.

2 Prerequisites

2.1 Purchase a virtual machine

If a local GPU is unavailable, you can rent a GPU server from cloud providers such as Alibaba Cloud or Tencent Cloud.

Ubuntu 22.04 is recommended; choose a GPU model that meets your needs and increase disk capacity because LLMs consume large storage.

2.2 Virtual environment

It is recommended to use

uv

to manage the Python virtual environment. Install it with:

<code>curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env</code>

3 Installation

3.1 Using GPU as the vLLM backend

3.1.1 System requirements

Operating system : Linux

Python version : 3.9 ~ 3.12

GPU : Compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A10, A100, L4, H100)

Note: Compute capability defines the hardware features of each NVIDIA GPU architecture and determines which CUDA or Tensor‑Core features are available; it does not directly indicate raw performance.

3.1.2 Install GPU dependencies

Run the following one‑liner to install NVIDIA driver, container toolkit, and runtime (required for Docker later):

<code>curl -sS https://raw.githubusercontent.com/cr7258/hands-on-lab/refs/heads/main/ai/gpu/setup/docker-only-install.sh | bash</code>

3.1.3 Install vLLM

Create a Python virtual environment:

<code># (Recommended) Create a new uv environment. Use `--seed` to install pip and setuptools in the environment.
uv venv --python 3.12 --seed
source .venv/bin/activate</code>

Then install vLLM:

<code>uv pip install vllm</code>

3.2 Using CPU as the vLLM backend

3.3 System requirements

Operating system : Linux

Python version : 3.9 ~ 3.12

Compiler : gcc/g++ ≥ 12.3.0 (optional, recommended)

3.4 Install compilation dependencies

vLLM does not provide pre‑built CPU packages; you must compile from source.

<code>uv venv vllm-cpu --python 3.12 --seed
source vllm-cpu/bin/activate
sudo apt-get update -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12</code>

Clone the repository and install build dependencies:

<code>git clone https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
pip install --upgrade pip
pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
pip install intel-extension-for-pytorch</code>

3.5 Build and install the CPU backend

<code>VLLM_TARGET_DEVICE=cpu python setup.py install</code>

3.3 Running vLLM with Docker

vLLM provides an official Docker image. Run the GPU backend with:

<code>docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-1.5B-Instruct</code>

Key arguments:

--runtime nvidia

: use NVIDIA container runtime.

--gpus all

: expose all host GPUs (or specify IDs).

-v …

: mount Hugging Face cache to avoid re‑downloading.

-p 8000:8000

: expose the OpenAI‑compatible API on port 8000.

--ipc=host

(or

--shm-size

): give the container access to host shared memory, which PyTorch uses for high‑throughput inference.

For CPU you can use the image

public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5

with the same flags.

4 Offline vs. Online Inference

Offline inference processes a batch of inputs without real‑time latency requirements, suitable for tasks like nightly data processing. Online inference serves real‑time requests, requiring low latency and stable resource allocation, typical for chatbots or search assistants.

5 Offline inference example

After installing vLLM, you can run batch generation with the following script (basic.py):

<code># basic.py
# https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic/basic.py
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

def main():
    llm = LLM(model="facebook/opt-125m")
    outputs = llm.generate(prompts, sampling_params)
    print("\nGenerated Outputs:\n" + "-"*60)
    for output in outputs:
        print(f"Prompt:    {output.prompt!r}")
        print(f"Output:    {output.outputs[0].text!r}")
        print("-"*60)

if __name__ == "__main__":
    main()
</code>

The script uses

SamplingParams

to set temperature 0.8 and top‑p 0.95. Temperature controls randomness; lower values make output more deterministic, higher values increase creativity. Top‑p limits sampling to the smallest set of tokens whose cumulative probability exceeds the threshold.

6 Online inference (OpenAI‑compatible API)

Start the server with:

<code>vllm serve Qwen/Qwen2.5-1.5B-Instruct</code>

The server listens on

http://localhost:8000

. You can query the model via the OpenAI Completion API:

<code>curl -sS http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"Qwen/Qwen2.5-1.5B-Instruct","prompt":"San Francisco is a"}' | jq</code>

Or use the OpenAI Python SDK after setting

api_key="EMPTY"

and

base_url="http://localhost:8000/v1"

. The same works for the Chat Completion endpoint, allowing multi‑turn conversations.

7 Performance comparison between GPU and CPU backends

Test hardware: 32 vCPU, 188 GiB RAM, NVIDIA A10 GPU. A loop of 100 requests shows the GPU backend generating roughly 130 tokens / s, while the CPU backend generates about 12 tokens / s, i.e., a ten‑fold speed advantage. Prefix caching is disabled on both sides for a fair comparison.

Conclusion

The article provides a comprehensive guide to deploying the high‑performance vLLM LLM inference framework, covering environment preparation, GPU/CPU backend configuration, offline and online serving, and a practical performance benchmark that demonstrates the substantial throughput gain when using a GPU.

pythonvLLMLLM inferenceGPU deploymentOpenAI APICPU deployment
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.