Master Multi‑GPU Load Balancing for OLLAMA: From Setup to Production
This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA and Docker setup, native and containerized deployment methods, core parameter tuning, advanced sharding, dynamic monitoring, troubleshooting, production best practices, and a real‑world RTX 4090 case study.
Why Multi‑GPU Load Balancing Matters
In AI model inference and training a single GPU often cannot satisfy performance requirements. Proper multi‑GPU load balancing can increase overall throughput by 2‑4×, reduce inference latency by 30‑50%, improve GPU utilization, and enhance system stability by distributing compute load.
Increase overall throughput 2‑4×
Reduce inference latency 30‑50%
Improve GPU utilization avoid idle devices
Enhance system stability distribute compute pressure
Environment Preparation
Hardware Checks
# Verify GPU presence and driver
nvidia-smi
lspci | grep -i nvidia
# Check CUDA toolkit version
nvcc --version
cat /usr/local/cuda/version.txtSoftware Setup
# Install CUDA driver and toolkit (example for Ubuntu)
sudo apt update
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit
# Verify installation
nvidia-smi
nvcc --version
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart dockerOLLAMA Multi‑GPU Configuration
Native Multi‑GPU Setup
# Install OLLAMA
curl -fsSL https://ollama.ai/install.sh | sh
# Export environment variables to expose all GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OLLAMA_GPU_LAYERS=35 # number of model layers per GPU
export OLLAMA_NUM_PARALLEL=4 # parallel inference threads
export OLLAMA_MAX_LOADED_MODELS=2 # maximum models kept in memory
# Start the OLLAMA service
ollama serveDocker‑Compose Deployment (recommended for production)
Create a docker-compose.yml file with the following content:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-multi-gpu
restart: unless-stopped
ports:
- "11434:11434"
environment:
- CUDA_VISIBLE_DEVICES=0,1,2,3
- OLLAMA_GPU_LAYERS=35
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=2
- OLLAMA_KEEP_ALIVE=24h
volumes:
- ./ollama-data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
healthcheck:
test: ["CMD","curl","-f","http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3Start the service with docker-compose up -d and monitor logs using docker-compose logs -f ollama.
Core Configuration Parameters
GPU Memory Management
# Use 80 % of each GPU's memory
export OLLAMA_GPU_MEMORY_FRACTION=0.8
# Split model by layer across GPUs
export OLLAMA_GPU_SPLIT_MODE=layer
# Enable dynamic GPU allocation and memory pooling
export OLLAMA_DYNAMIC_GPU=true
export OLLAMA_GPU_MEMORY_POOL=trueLoad‑Balancing Algorithm
import json
config = {
"gpu_allocation": {
"strategy": "round_robin", # alternatives: least_loaded, manual
"devices": [0, 1, 2, 3],
"weights": [1.0, 1.0, 1.0, 1.0],
"memory_threshold": 0.85
},
"model_sharding": {
"enabled": true,
"shard_size": "auto",
"overlap_ratio": 0.1
},
"performance": {
"batch_size": 4,
"max_concurrent_requests": 16,
"tensor_parallel_size": 4
}
}
with open('/etc/ollama/load_balance.json', 'w') as f:
json.dump(config, f, indent=2)Advanced Load‑Balancing Strategies
Intelligent Model Sharding
# model_sharding.sh
#!/bin/bash
MODEL_NAME="llama2:70b"
SHARD_COUNT=4
# Pull the model and enable sharding
ollama pull $MODEL_NAME
export OLLAMA_MODEL_SHARDS=$SHARD_COUNT
export OLLAMA_SHARD_STRATEGY="balanced"
# Launch each shard on a separate GPU
for i in $(seq 0 $((SHARD_COUNT-1))); do
CUDA_VISIBLE_DEVICES=$i ollama run $MODEL_NAME --shard-id $i &
done
waitDynamic GPU Usage Monitoring
# gpu_monitor.py
import pynvml, time, json, datetime
def monitor_gpu_usage():
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
while True:
stats = []
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
stats.append({
'gpu_id': i,
'gpu_util': util.gpu,
'memory_util': mem.used / mem.total * 100,
'memory_used_mb': mem.used // 1024**2,
'memory_total_mb': mem.total // 1024**2,
'temperature': temp,
'timestamp': datetime.datetime.now().isoformat()
})
print(json.dumps(stats, indent=2))
balance_gpus(stats)
time.sleep(5)
def balance_gpus(stats):
avg = sum(s['gpu_util'] for s in stats) / len(stats)
for s in stats:
if s['gpu_util'] > avg * 1.2:
print(f"GPU {s['gpu_id']} overload: {s['gpu_util']}%")
elif s['gpu_util'] < avg * 0.5:
print(f"GPU {s['gpu_id']} underload: {s['gpu_util']}%")
if __name__ == "__main__":
monitor_gpu_usage()Troubleshooting
Common Issues and Solutions
GPU memory insufficient – Query memory usage with nvidia-smi --query-gpu=memory.used,memory.total --format=csv. Reduce the number of loaded layers (e.g., export OLLAMA_GPU_LAYERS=20) or enable CPU fallback ( export OLLAMA_CPU_FALLBACK=true).
Load imbalance – View model placement with ollama ps, stop all services ( ollama stop --all) and restart with strict load‑balancing mode ( ollama serve --load-balance-mode=strict).
High communication latency – Inspect GPU topology using nvidia-smi topo -m and enable peer‑to‑peer communication if supported.
Alert and Monitoring Script
# gpu_alert.sh
HIGH_UTIL_THRESHOLD=90
LOW_UTIL_THRESHOLD=10
TEMP_THRESHOLD=80
while true; do
nvidia-smi --query-gpu=index,utilization.gpu,temperature.gpu --format=csv,noheader,nounits |
while IFS=, read gpu_id util temp; do
if (( util > HIGH_UTIL_THRESHOLD )); then
echo "ALERT: GPU $gpu_id usage high: $util%"
elif (( util < LOW_UTIL_THRESHOLD )); then
echo "WARNING: GPU $gpu_id usage low: $util%"
fi
if (( temp > TEMP_THRESHOLD )); then
echo "CRITICAL: GPU $gpu_id temperature high: $temp°C"
fi
done
sleep 30
doneProduction Best Practices
Containerized Deployment Architecture
# production-docker-compose.yml
version: '3.8'
services:
ollama-lb:
image: nginx:alpine
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- ollama-node-1
- ollama-node-2
ollama-node-1:
image: ollama/ollama:latest
environment:
- CUDA_VISIBLE_DEVICES=0,1
- OLLAMA_GPU_LAYERS=35
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0','1']
capabilities: [gpu]
ollama-node-2:
image: ollama/ollama:latest
environment:
- CUDA_VISIBLE_DEVICES=2,3
- OLLAMA_GPU_LAYERS=35
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['2','3']
capabilities: [gpu]Automated Deployment Script
# auto_deploy.sh
#!/bin/bash
set -e
check_prerequisites() {
echo "Checking CUDA environment..."
nvidia-smi >/dev/null || { echo "CUDA missing"; exit 1; }
echo "Checking Docker..."
docker --version >/dev/null || { echo "Docker missing"; exit 1; }
GPU_COUNT=$(nvidia-smi --list-gpus | wc -l)
echo "Detected $GPU_COUNT GPUs"
}
benchmark_performance() {
echo "Running baseline benchmark..."
docker run --rm --gpus all ollama/ollama:latest ollama run llama2:7b "Hello world" >/dev/null
for i in $(seq 0 $((GPU_COUNT-1))); do
echo "Testing GPU $i..."
CUDA_VISIBLE_DEVICES=$i docker run --rm --gpus device=$i ollama/ollama:latest ollama run llama2:7b "Test GPU $i"
done
}
main() {
check_prerequisites
benchmark_performance
echo "Deploying multi‑GPU OLLAMA cluster..."
docker-compose -f production-docker-compose.yml up -d
echo "Waiting for service to start..."
sleep 30
curl -f http://localhost/api/tags || { echo "Service start failed"; exit 1; }
echo "Deployment complete!"
}
main "$@"Monitoring with Prometheus and Grafana
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'nvidia-gpu'
static_configs:
- targets: ['localhost:9400']
scrape_interval: 5s
- job_name: 'ollama-metrics'
static_configs:
- targets: ['localhost:11434']
metrics_path: '/metrics'Grafana dashboard JSON (trimmed) defines panels for GPU utilization and memory usage.
Case Study: 4‑GPU RTX 4090 Cluster Optimization
Hardware : 4 × RTX 4090 (24 GB VRAM each), AMD Threadripper 3970X, 128 GB DDR4 RAM, NVMe SSD.
Baseline performance : inference latency 2.3 s, concurrency 4 req/s, GPU utilization 65 %.
Optimized environment variables :
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OLLAMA_GPU_LAYERS=40
export OLLAMA_NUM_PARALLEL=8
export OLLAMA_MAX_LOADED_MODELS=4
export OLLAMA_BATCH_SIZE=6
export OLLAMA_GPU_MEMORY_FRACTION=0.9
export OLLAMA_TENSOR_PARALLEL_SIZE=4After optimization : inference latency 0.8 s (‑65 %), concurrency 12 req/s (‑200 %), GPU utilization 92 % (‑27 %).
Conclusion
Following the procedures above enables reliable multi‑GPU OLLAMA deployments, intelligent load distribution, comprehensive monitoring, and measurable performance improvements. Future work may include CI/CD pipeline integration, Kubernetes orchestration, and connection to AI model‑management platforms.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
