How to Deploy GPUStack with Docker for Scalable AI Model Serving
This guide walks you through installing NVIDIA drivers and Docker, configuring the NVIDIA Container Toolkit, and deploying GPUStack in Docker to manage heterogeneous GPU resources, run large language, multimodal, diffusion, and embedding models, and scale from a single node to a multi‑node GPU cluster.
Docker Run GPUStack Detailed Tutorial
GPUStack
GPUStack is an open‑source GPU cluster manager for running AI models. It supports a wide range of hardware (Apple Metal, NVIDIA CUDA, AMD ROCm, Huawei Ascend, MUSA, and more) and model types (LLMs, VLMs, diffusion, embedding, re‑ranking, and speech models). GPUStack can scale by adding GPUs or nodes, supports single‑node multi‑GPU and multi‑node inference, and offers multiple inference back‑ends such as llama-box, vox-box and vLLM. It is a lightweight Python package with minimal dependencies, provides an OpenAI‑compatible API, simplifies user and API‑key management, and offers real‑time GPU performance monitoring and token/rate‑limit tracking.
Key Features
Broad hardware compatibility : manage GPUs on Apple Mac, Windows PC, and Linux servers.
Extensive model support : LLMs, VLMs, diffusion, audio, embedding, and re‑ranking models.
Heterogeneous GPU & scaling : add heterogeneous GPU resources and expand compute capacity on demand.
Distributed inference : supports single‑node multi‑GPU and multi‑node parallel inference.
Multiple inference back‑ends : llama-box (based on llama.cpp), vox-box, and vLLM.
Lightweight Python package : minimal dependencies and overhead.
OpenAI‑compatible API : standard API service.
User & API key management : streamlined credential handling.
GPU metrics monitoring : real‑time performance and utilization.
Token usage & rate statistics : track token consumption and enforce rate limits.
Supported Hardware Platforms
Apple Metal (M‑series chips)
NVIDIA CUDA (compute capability 6.0+)
AMD ROCm
Huawei Ascend (CANN)
MUSA
Sea‑AI DTK
Supported Model Types
Large Language Models (e.g., Qwen, LLaMA, Mistral, DeepSeek, Phi, Yi)
Multimodal Models (e.g., Llama3.2‑Vision, Pixtral, Qwen2‑VL, LLaVA, InternVL2.5)
Diffusion Models (e.g., Stable Diffusion, FLUX)
Embedding Models (e.g., BGE, BCE, Jina)
Re‑ranking Models (e.g., BGE, BCE, Jina)
Speech Models (e.g., Whisper, CosyVoice)
Use Cases
GPUStack is ideal for scenarios that require efficient GPU resource management and scheduling, especially when serving AI models. It supports both single‑node multi‑GPU and multi‑node inference/services, offering flexible back‑ends.
1. Environment Preparation
Hardware & System Requirements
Ensure an NVIDIA GPU is installed and drivers compatible (CUDA 11.0+).
Recommended OS: Ubuntu 22.04 LTS or CentOS 7+.
Verify GPU & Dependencies
# Check NVIDIA GPU detection
lspci | grep -i nvidia
# Verify GCC installation
gcc --version2. Install NVIDIA Drivers & Docker
Install NVIDIA driver
# Install kernel headers
sudo apt-get install linux-headers-$(uname -r)
# Add CUDA repository and install driver
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install nvidia-driver-535 -y
sudo reboot
# Verify driver
nvidia-smiInstall Docker Engine
# Remove old Docker versions
sudo apt-get remove docker.io docker-doc containerd
# Add Docker official repo
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io -y
# Verify Docker
docker infoConfigure NVIDIA Container Toolkit
# Add repository and install toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit -y
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Verify CUDA container
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi3. Deploy GPUStack Container
Run GPUStack master node
docker run -d \
--gpus all \
-p 890:80 \
--ipc=host \
--name gpustack \
-v gpustack-data:/var/lib/gpustack \
gpustack/gpustack:latest --gpus all: expose all GPU resources. --ipc=host: share host IPC namespace for performance. -v gpustack-data: persist configuration and model data.
Get initial admin password
docker exec -it gpustack cat /var/lib/gpustack/initial_admin_passwordAccess http://<server‑IP> with user admin and the retrieved password (change on first login).
4. Expand GPU Cluster
Add Worker node
Obtain token from master:
docker exec -it gpustack cat /var/lib/gpustack/tokenRun worker on a new machine:
docker run -d \
--gpus all \
--network=host \
--ipc=host \
gpustack/gpustack \
--server-url http://<master‑IP> \
--token <token>5. Functional Usage Examples
Deploy a large model In the GPUStack console, go to the Models page and import a model from Hugging Face or a local path (e.g., Llama3.2). GPUStack automatically allocates GPU resources and creates an API endpoint.
Playground testing Use the Playground to test multimodal models (e.g., Stable Diffusion) or text embedding models (e.g., BERT), compare multiple models, and tune parameters.
6. Frequently Asked Questions
GPU not recognized : run nvidia-smi to verify driver installation and Docker runtime configuration.
Container fails to start : ensure --ipc=host is set and persistent volumes are mounted.
Network issues : open firewall ports 80 and the internal RPC port (default 6789) for cross‑node communication.
7. References
GPUStack official Docker deployment documentation.
NVIDIA Container Toolkit configuration guide.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
