Artificial Intelligence 15 min read

How to Deploy GPUStack with Docker for Scalable AI Model Serving

This guide walks you through installing NVIDIA drivers and Docker, configuring the NVIDIA Container Toolkit, and deploying GPUStack in Docker to manage heterogeneous GPU resources, run large language, multimodal, diffusion, and embedding models, and scale from a single node to a multi‑node GPU cluster.

Raymond Ops

Nov 4, 2025

Docker Run GPUStack Detailed Tutorial

GPUStack

GPUStack is an open‑source GPU cluster manager for running AI models. It supports a wide range of hardware (Apple Metal, NVIDIA CUDA, AMD ROCm, Huawei Ascend, MUSA, and more) and model types (LLMs, VLMs, diffusion, embedding, re‑ranking, and speech models). GPUStack can scale by adding GPUs or nodes, supports single‑node multi‑GPU and multi‑node inference, and offers multiple inference back‑ends such as llama-box, vox-box and vLLM. It is a lightweight Python package with minimal dependencies, provides an OpenAI‑compatible API, simplifies user and API‑key management, and offers real‑time GPU performance monitoring and token/rate‑limit tracking.

Key Features

Broad hardware compatibility : manage GPUs on Apple Mac, Windows PC, and Linux servers.

Extensive model support : LLMs, VLMs, diffusion, audio, embedding, and re‑ranking models.

Heterogeneous GPU & scaling : add heterogeneous GPU resources and expand compute capacity on demand.

Distributed inference : supports single‑node multi‑GPU and multi‑node parallel inference.

Multiple inference back‑ends : llama-box (based on llama.cpp), vox-box, and vLLM.

Lightweight Python package : minimal dependencies and overhead.

OpenAI‑compatible API : standard API service.

User & API key management : streamlined credential handling.

GPU metrics monitoring : real‑time performance and utilization.

Token usage & rate statistics : track token consumption and enforce rate limits.

Supported Hardware Platforms

Apple Metal (M‑series chips)

NVIDIA CUDA (compute capability 6.0+)

AMD ROCm

Huawei Ascend (CANN)

MUSA

Sea‑AI DTK

Supported Model Types

Large Language Models (e.g., Qwen, LLaMA, Mistral, DeepSeek, Phi, Yi)

Multimodal Models (e.g., Llama3.2‑Vision, Pixtral, Qwen2‑VL, LLaVA, InternVL2.5)

Diffusion Models (e.g., Stable Diffusion, FLUX)

Embedding Models (e.g., BGE, BCE, Jina)

Re‑ranking Models (e.g., BGE, BCE, Jina)

Speech Models (e.g., Whisper, CosyVoice)

Use Cases

GPUStack is ideal for scenarios that require efficient GPU resource management and scheduling, especially when serving AI models. It supports both single‑node multi‑GPU and multi‑node inference/services, offering flexible back‑ends.

1. Environment Preparation

Hardware & System Requirements

Ensure an NVIDIA GPU is installed and drivers compatible (CUDA 11.0+).

Recommended OS: Ubuntu 22.04 LTS or CentOS 7+.

Verify GPU & Dependencies

# Check NVIDIA GPU detection
lspci | grep -i nvidia

# Verify GCC installation
gcc --version

2. Install NVIDIA Drivers & Docker

Install NVIDIA driver

# Install kernel headers
sudo apt-get install linux-headers-$(uname -r)
# Add CUDA repository and install driver
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install nvidia-driver-535 -y
sudo reboot
# Verify driver
nvidia-smi

Install Docker Engine

# Remove old Docker versions
sudo apt-get remove docker.io docker-doc containerd
# Add Docker official repo
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io -y
# Verify Docker
docker info

Configure NVIDIA Container Toolkit

# Add repository and install toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit -y
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify CUDA container
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

3. Deploy GPUStack Container

Run GPUStack master node

docker run -d \
  --gpus all \
  -p 890:80 \
  --ipc=host \
  --name gpustack \
  -v gpustack-data:/var/lib/gpustack \
  gpustack/gpustack:latest

--gpus all

: expose all GPU resources. --ipc=host: share host IPC namespace for performance. -v gpustack-data: persist configuration and model data.

Get initial admin password

docker exec -it gpustack cat /var/lib/gpustack/initial_admin_password

Access http://<server‑IP> with user admin and the retrieved password (change on first login).

4. Expand GPU Cluster

Add Worker node

Obtain token from master:

docker exec -it gpustack cat /var/lib/gpustack/token

Run worker on a new machine:

docker run -d \
  --gpus all \
  --network=host \
  --ipc=host \
  gpustack/gpustack \
  --server-url http://<master‑IP> \
  --token <token>

5. Functional Usage Examples

Deploy a large model In the GPUStack console, go to the Models page and import a model from Hugging Face or a local path (e.g., Llama3.2). GPUStack automatically allocates GPU resources and creates an API endpoint.

Playground testing Use the Playground to test multimodal models (e.g., Stable Diffusion) or text embedding models (e.g., BERT), compare multiple models, and tune parameters.

6. Frequently Asked Questions

GPU not recognized : run nvidia-smi to verify driver installation and Docker runtime configuration.

Container fails to start : ensure --ipc=host is set and persistent volumes are mounted.

Network issues : open firewall ports 80 and the internal RPC port (default 6789) for cross‑node communication.

7. References

GPUStack official Docker deployment documentation.

NVIDIA Container Toolkit configuration guide.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Nvidia AI Model Deployment GPU cluster OpenAI API GPUStack

Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.