Artificial Intelligence 18 min read

Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama

This article walks through two mainstream local deployment solutions—high‑performance VLLM for production Linux servers and lightweight Ollama for personal Windows machines—covering environment setup, model download, server launch, API testing, key configuration parameters, and the quantization technique that makes Ollama models compact.

Fun with Large Models

Jan 18, 2026

Step‑by‑Step Guide to Deploying Large Language Models Locally with VLLM and Ollama

Local deployment of large language models (LLMs) offers data privacy, full control, and predictable long‑term costs compared with cloud APIs. The guide compares two approaches: VLLM, a Python library from UC Berkeley optimized for high‑throughput inference on Linux GPUs, and Ollama, a one‑click tool built on llama.cpp that runs on Windows and low‑end hardware.

VLLM Deployment

VLLM provides memory‑efficient paging attention, dynamic request merging, and OpenAI‑compatible APIs. It requires Linux and a high‑end GPU (e.g., H100 with 80 GB VRAM). The tutorial uses a Lab4AI cloud instance with an H100, creates a VS Code cloud VM, selects the provided Anaconda environment (named lf) that already contains llamafactory and VLLM, and verifies the VLLM version with pip show vllm.

Key steps:

Create a Lab4AI instance and open a VS Code terminal.

Select the appropriate Docker image and start the VM.

Check the environment: lf conda env with VLLM installed.

Inspect GPU memory with nvidia-smi; a 32B‑parameter model needs ~66 GB VRAM, so the tutorial uses the lighter Qwen3-4B model.

Download the model from ModelScope:

modelscope download --model Qwen/Qwen3-4B --local_dir ./Qwen3-4B

Start the VLLM server:

vllm serve ./Qwen3-4B/ --served-model-name Qwen3-4B --max-model-len 32768 --gpu-memory-utilization 0.9 --port 6666

Test the service with a Python script using the OpenAI client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:6666/v1", api_key="EMPTY")
response = client.chat.completions.create(model="Qwen3-4B", messages=[{"role": "user", "content": "你好"}])
print(response.choices[0].message.content)

Common VLLM parameters (illustrated as a list):

--max-model-len 32768 : maximum context length (32 K tokens for Qwen3).

--gpu-memory-utilization 0.8‑0.95 : proportion of GPU memory used.

--tensor-parallel-size : number of GPUs (must be a power of two).

--max-num-seqs 256 : maximum concurrent requests.

--enforce-eager : required on Ascend NPU to avoid compilation errors.

--api-key <key> : optional API key for basic security.

--enable-function-calling : enable tool/function calling.

--pipeline-parallel-size : pipeline parallelism, used together with tensor parallelism.

--enable-expert-parallel True/False : MoE model optimization.

Optimization tips include multi‑instance load balancing with Nginx and combining tensor, pipeline, and expert parallelism for very large models.

Ollama Deployment

Ollama targets personal devices and Windows. It wraps llama.cpp, provides a one‑click installer, and uses model quantization (INT4) to shrink model size dramatically (e.g., Qwen3‑4B drops from 8.1 GB to ~2.5 GB). The workflow is:

Download the Windows installer from https://ollama.com/ and run it.

After installation, change the default model storage directory to a non‑system drive.

Search for a model (e.g., qwen3) on the Ollama website and pull it with ollama pull qwen3:4b (or ollama run qwen3 for the 8 B version).

Test the model directly in the Ollama UI or via the OpenAI‑compatible API (default port 11434):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="EMPTY")
response = client.chat.completions.create(model="qwen3:4b", messages=[{"role": "user", "content": "你好"}])
print(response.choices[0].message.content)

Understand Ollama’s storage layout: a blobs folder holds the quantized binary files and a manifests folder stores metadata.

The quantization process converts FP16 weights to INT4, reducing storage and enabling CPU‑only inference. Ollama stores models in the GGUF format, a single‑file binary optimized for fast inference, unlike the multi‑file *.safetensors format used elsewhere.

Conclusion

The guide equips readers with end‑to‑end instructions for both production‑grade VLLM deployment on Linux GPUs and lightweight Ollama deployment on personal machines, explains the underlying quantization that powers Ollama’s small model size, and provides practical command‑line examples and configuration tips for building reliable local LLM services.

large language models GPU Optimization Local Deployment Ollama model quantization OpenAI API

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.