Artificial Intelligence 15 min read

How to Deploy GPUStack with Docker for Scalable AI Model Serving

This guide walks you through installing NVIDIA drivers, Docker, and the NVIDIA Container Toolkit, then shows step‑by‑step how to run GPUStack in Docker, expand a GPU cluster, and serve large language, multimodal, diffusion, and embedding models with OpenAI‑compatible APIs.

MaGe Linux Operations

Jun 3, 2025

How to Deploy GPUStack with Docker for Scalable AI Model Serving

Docker Tutorial for Running GPUStack

GPUStack is an open‑source GPU cluster manager designed for AI workloads. It supports a wide range of hardware, multiple model families (LLMs, VLMs, diffusion, audio, embedding, reranking), and offers distributed inference with back‑ends such as llama-box, vox-box, and vLLM. The lightweight Python package provides an OpenAI‑compatible API, real‑time GPU monitoring, token usage tracking, and simple user/API‑key management.

Key Features

Broad hardware compatibility : manages GPUs on Apple Metal (M‑series), NVIDIA CUDA, AMD ROCm, Huawei Ascend (CANN), MooreThreads MUSA, and HaiGuang DTK.

Extensive model support : LLMs (Qwen, LLaMA, Mistral, DeepSeek, Phi, Yi), multimodal VLMs (Llama‑3.2‑Vision, Pixtral, Qwen‑2‑VL, LLaVA, InternVL2.5), diffusion models (Stable Diffusion, FLUX), speech models (Whisper, CosyVoice), embedding and reranking models (BGE, BCE, Jina).

Heterogeneous GPU & scaling : add mixed‑GPU nodes on‑the‑fly, scale compute power as needed.

Distributed inference : single‑node multi‑GPU and multi‑node multi‑GPU parallel inference.

Multiple inference back‑ends : llama-box (based on llama.cpp), vox-box, vLLM.

Lightweight Python package : minimal dependencies and overhead.

OpenAI‑compatible API : standard REST endpoints for model serving.

User & API‑key management : simplified credential handling.

GPU metrics monitoring : real‑time performance and utilization.

Token usage & rate‑limit statistics : accurate tracking and enforcement.

Supported Hardware Platforms

Apple Metal (M‑series chips)

NVIDIA CUDA (compute capability 6.0+)

AMD ROCm

Huawei Ascend (CANN)

MooreThreads MUSA

HaiGuang DTK

Supported Model Types

Large Language Models (LLMs): Qwen, LLaMA, Mistral, DeepSeek, Phi, Yi, etc.

Multimodal Vision‑Language Models (VLMs): Llama‑3.2‑Vision, Pixtral, Qwen‑2‑VL, LLaVA, InternVL2.5.

Diffusion Models: Stable Diffusion, FLUX.

Audio Models: Whisper (speech‑to‑text), CosyVoice (text‑to‑speech).

Embedding Models: BGE, BCE, Jina.

Reranking Models: BGE, BCE, Jina.

Usage Scenarios

GPUStack is ideal for environments that need efficient GPU resource management and scheduling for AI inference, supporting both single‑node multi‑GPU and multi‑node clusters with various back‑ends.

Step‑by‑Step Tutorial

1. Environment Preparation

Hardware & System Requirements

Ensure an NVIDIA GPU is installed; driver compatible with CUDA 11.0+.

Recommended OS: Ubuntu 22.04 LTS or CentOS 7+.

Verify GPU & Dependencies

# Check NVIDIA GPU detection
lspci | grep -i nvidia

# Verify GCC version
gcc --version

2. Install NVIDIA Driver & Docker

Install NVIDIA Driver

# Install kernel headers
sudo apt-get install linux-headers-$(uname -r)
# Add CUDA repository and install driver
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install nvidia-driver-535 -y
sudo reboot
# Verify driver
nvidia-smi

Install Docker Engine

# Remove old Docker versions
sudo apt-get remove docker.io docker-doc containerd
# Add Docker official repo
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io -y
# Verify Docker
docker info

Configure NVIDIA Container Toolkit

# Add repository and install toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit -y
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Test CUDA container
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

3. Deploy GPUStack Container

docker run -d \
  --gpus all \
  -p 890:80 \
  --ipc=host \
  --name gpustack \
  -v gpustack-data:/var/lib/gpustack \
  gpustack/gpustack:latest

Parameter notes : --gpus all: expose all GPU devices. --ipc=host: share host IPC namespace for better performance. -v gpustack-data: persist configuration and model data.

4. Retrieve Initial Admin Password

docker exec -it gpustack cat /var/lib/gpustack/initial_admin_password

Access the UI at http://<server‑IP> using admin and the retrieved password (change it on first login).

5. Expand GPU Cluster

Obtain a token from the master node:

docker exec -it gpustack cat /var/lib/gpustack/token

Run a worker node:

docker run -d \
  --gpus all \
  --network=host \
  --ipc=host \
  gpustack/gpustack \
  --server-url http://<master‑IP> \
  --token <token‑from‑master>

6. Functional Usage Examples

Deploy a Large Model : In the GPUStack console, go to the Models page and import a model from Hugging Face or a local path (e.g., Llama‑3.2). The system automatically allocates GPU resources and creates an API endpoint.

Playground Testing : Use the built‑in Playground to test multimodal models (Stable Diffusion), text embeddings (BERT), and compare multiple models with parameter tuning.

7. Common Issues

GPU not recognized : run nvidia-smi and verify Docker runtime configuration.

Container fails to start : ensure --ipc=host is set and the persistent volume is mounted.

Network problems : open firewall port 80 and internal RPC port 6789 for cross‑node communication.

8. References

GPUStack official Docker deployment documentation.

NVIDIA Container Toolkit configuration guide.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker AI model deployment GPU Cluster OpenAI API GPUStack

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.