Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

This guide walks you through configuring OLLAMA for multi‑GPU load balancing, covering hardware checks, CUDA setup, native and Docker deployment methods, detailed parameter tuning, advanced sharding strategies, troubleshooting, performance optimization, and production‑grade monitoring to maximize throughput and stability of large language models.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Multi‑GPU Load Balancing for OLLAMA: From Zero to Production

CUDA & OLLAMA Multi‑GPU Load Balancing Complete Guide: From Zero to Production

TL;DR: This article dives into configuring OLLAMA for multi‑GPU environments, achieving GPU load balancing and sharing production‑grade best practices.

Why is multi‑GPU load balancing important?

In AI model inference and training, a single GPU often cannot meet large‑scale performance demands. Proper multi‑GPU load balancing can:

Increase overall throughput 2‑4×

Reduce single‑inference latency 30‑50%

Improve resource utilization avoid idle GPUs

Enhance system stability by spreading compute pressure

Environment preparation: building a solid foundation

Hardware requirements check

# Check GPU information
nvidia-smi
lspci | grep -i nvidia

# Check CUDA version compatibility
nvcc --version
cat /usr/local/cuda/version.txt

Software environment configuration

# Install required CUDA toolkit
sudo apt update
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit

# Verify CUDA installation
nvidia-smi
nvcc --version

# Install Docker and NVIDIA Container Toolkit (recommended)
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Configure NVIDIA Container Runtime
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/${distribution}/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

OLLAMA multi‑GPU configuration details

Method 1: Native multi‑GPU configuration

# Install OLLAMA
curl -fsSL https://ollama.ai/install.sh | sh

# Set environment variables for multiple GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OLLAMA_GPU_LAYERS=35
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_MAX_LOADED_MODELS=2

# Start OLLAMA service
ollama serve

Method 2: Docker container deployment (recommended for production)

Create docker-compose.yml:

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama-multi-gpu
    restart: unless-stopped
    ports:
      - "11434:11434"
    environment:
      - CUDA_VISIBLE_DEVICES=0,1,2,3
      - OLLAMA_GPU_LAYERS=35
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
      - OLLAMA_KEEP_ALIVE=24h
    volumes:
      - ./ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    healthcheck:
      test: ["CMD","curl","-f","http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

Start the service:

# Launch multi‑GPU OLLAMA service
docker-compose up -d

# View service logs
docker-compose logs -f ollama

Core configuration parameters deep dive

GPU memory management strategy

# Precise GPU memory allocation
export OLLAMA_GPU_MEMORY_FRACTION=0.8   # Use 80% of GPU memory
export OLLAMA_GPU_SPLIT_MODE=layer    # Split model by layers

# Dynamic memory management
export OLLAMA_DYNAMIC_GPU=true
export OLLAMA_GPU_MEMORY_POOL=true

Load balancing algorithm configuration

# Create load balancing config file load_balance_config.py
import json
config = {
  "gpu_allocation": {
    "strategy": "round_robin",  # round_robin, least_loaded, manual
    "devices": [0,1,2,3],
    "weights": [1.0,1.0,1.0,1.0],
    "memory_threshold": 0.85
  },
  "model_sharding": {
    "enabled": True,
    "shard_size": "auto",
    "overlap_ratio": 0.1
  },
  "performance": {
    "batch_size": 4,
    "max_concurrent_requests": 16,
    "tensor_parallel_size": 4
  }
}
with open('/etc/ollama/load_balance.json','w') as f:
    json.dump(config, f, indent=2)

Advanced load balancing strategies

1. Intelligent sharding deployment

# Create model sharding script
cat > model_sharding.sh <<'EOF'
#!/bin/bash
MODEL_NAME="llama2:70b"
SHARD_COUNT=4

# Pull and shard model
ollama pull $MODEL_NAME

# Set sharding parameters
export OLLAMA_MODEL_SHARDS=$SHARD_COUNT
export OLLAMA_SHARD_STRATEGY="balanced"

# Distribute to different GPUs
for i in $(seq 0 $((SHARD_COUNT-1))); do
  CUDA_VISIBLE_DEVICES=$i ollama run $MODEL_NAME --shard-id $i &
 done
wait
EOF
chmod +x model_sharding.sh
./model_sharding.sh

2. Dynamic load monitoring

# GPU monitoring script gpu_monitor.py
import pynvml, time, json
from datetime import datetime

def monitor_gpu_usage():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    while True:
        gpu_stats = []
        for i in range(device_count):
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            gpu_stats.append({
                'gpu_id': i,
                'gpu_util': util.gpu,
                'memory_util': (mem_info.used / mem_info.total) * 100,
                'memory_used_mb': mem_info.used // 1024**2,
                'memory_total_mb': mem_info.total // 1024**2,
                'temperature': temp,
                'timestamp': datetime.now().isoformat()
            })
        print(json.dumps(gpu_stats, indent=2))
        avg_util = sum(stat['gpu_util'] for stat in gpu_stats) / len(gpu_stats)
        for stat in gpu_stats:
            if stat['gpu_util'] > avg_util * 1.2:
                print(f"GPU {stat['gpu_id']} overload: {stat['gpu_util']}%")
            elif stat['gpu_util'] < avg_util * 0.5:
                print(f"GPU {stat['gpu_id']} underutilized: {stat['gpu_util']}%")
        time.sleep(5)

if __name__ == "__main__":
    monitor_gpu_usage()

Fault troubleshooting guide

Common issues and solutions

Problem 1: GPU memory insufficient

# Check GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Solution: adjust model sharding
export OLLAMA_GPU_LAYERS=20   # Reduce GPU layers
export OLLAMA_CPU_FALLBACK=true

Problem 2: Load imbalance

# Force re‑allocation of load
ollama ps          # View current model distribution
ollama stop --all
ollama serve --load-balance-mode=strict

Problem 3: High communication latency

# Check inter‑GPU communication
nvidia-smi topo -m

# Optimize P2P communication
echo 1 | sudo tee /sys/module/nvidia/parameters/NVreg_EnableGpuFirmware

Monitoring alert configuration

# Create GPU alert script
cat > gpu_alert.sh <<'EOF'
#!/bin/bash
HIGH_UTIL_THRESHOLD=90
LOW_UTIL_THRESHOLD=10
TEMP_THRESHOLD=80

while true; do
  nvidia-smi --query-gpu=index,utilization.gpu,temperature.gpu --format=csv,noheader,nounits |
  while IFS=, read -r gpu_id util temp; do
    if (( util > HIGH_UTIL_THRESHOLD )); then
      echo "ALERT: GPU $gpu_id utilization high: ${util}%"
      curl -X POST "https://your-webhook-url" -d "GPU $gpu_id overloaded: ${util}%"
    fi
    if (( util < LOW_UTIL_THRESHOLD )); then
      echo "WARNING: GPU $gpu_id utilization low: ${util}%"
    fi
    if (( temp > TEMP_THRESHOLD )); then
      echo "CRITICAL: GPU $gpu_id temperature high: ${temp}°C"
    fi
  done
  sleep 30
done
EOF
chmod +x gpu_alert.sh
nohup ./gpu_alert.sh &

Production best practices

1. Containerized deployment architecture

# production-docker-compose.yml
version: '3.8'
services:
  ollama-lb:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - ollama-node-1
      - ollama-node-2

  ollama-node-1:
    image: ollama/ollama:latest
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - OLLAMA_GPU_LAYERS=35
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0','1']
              capabilities: [gpu]

  ollama-node-2:
    image: ollama/ollama:latest
    environment:
      - CUDA_VISIBLE_DEVICES=2,3
      - OLLAMA_GPU_LAYERS=35
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['2','3']
              capabilities: [gpu]

2. Automated operation scripts

# Auto‑deployment script
cat > auto_deploy.sh <<'EOF'
#!/bin/bash
set -e

check_prerequisites() {
  echo "Checking CUDA environment..."
  nvidia-smi >/dev/null || { echo "CUDA environment error"; exit 1; }
  echo "Checking Docker environment..."
  docker --version >/dev/null || { echo "Docker not installed"; exit 1; }
  GPU_COUNT=$(nvidia-smi --list-gpus | wc -l)
  echo "Detected $GPU_COUNT GPUs"
}

benchmark_performance() {
  echo "Running performance benchmark..."
  docker run --rm --gpus all ollama/ollama:latest ollama run llama2:7b "Hello world" >/dev/null
  for i in $(seq 0 $((GPU_COUNT-1))); do
    echo "Testing GPU $i..."
    CUDA_VISIBLE_DEVICES=$i docker run --rm --gpus device=$i ollama/ollama:latest ollama run llama2:7b "Test GPU $i"
  done
}

main() {
  check_prerequisites
  benchmark_performance
  echo "Deploying multi‑GPU OLLAMA cluster..."
  docker-compose -f production-docker-compose.yml up -d
  echo "Waiting for services..."
  sleep 30
  curl -f http://localhost/api/tags || { echo "Service start failed"; exit 1; }
  echo "Deployment complete!"
}

main "$@"
EOF
chmod +x auto_deploy.sh

Performance tuning case study

Case: 4‑card RTX 4090 cluster optimization

Hardware configuration :

4 × RTX 4090 (24 GB VRAM each)

AMD Threadripper 3970X

128 GB DDR4 RAM

NVMe SSD storage

Pre‑optimization performance :

Single inference latency: 2.3 s

Concurrent throughput: 4 requests/s

GPU utilization: 65 %

Optimization configuration :

export CUDA_VISIBLE_DEVICES=0,1,2,3
export OLLAMA_GPU_LAYERS=40
export OLLAMA_NUM_PARALLEL=8
export OLLAMA_MAX_LOADED_MODELS=4
export OLLAMA_BATCH_SIZE=6
export OLLAMA_GPU_MEMORY_FRACTION=0.9
export OLLAMA_TENSOR_PARALLEL_SIZE=4

Post‑optimization performance :

Single inference latency: 0.8 s (‑65 %)

Concurrent throughput: 12 requests/s (+200 %)

GPU utilization: 92 % (+27 %)

Monitoring and operations automation

Prometheus monitoring configuration

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'nvidia-gpu'
    static_configs:
      - targets: ['localhost:9400']
    scrape_interval: 5s

  - job_name: 'ollama-metrics'
    static_configs:
      - targets: ['localhost:11434']
    metrics_path: '/metrics'

Grafana dashboard JSON (excerpt)

{
  "dashboard": {
    "title": "OLLAMA Multi‑GPU Monitoring",
    "panels": [
      {
        "title": "GPU Utilization",
        "type": "graph",
        "targets": [{ "expr": "nvidia_gpu_utilization_gpu" }]
      },
      {
        "title": "GPU Memory Usage",
        "type": "graph",
        "targets": [{ "expr": "nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes * 100" }]
      }
    ]
  }
}

Summary and outlook

By following this comprehensive configuration and optimization guide, you should be able to:

Master multi‑GPU environment setup from hardware checks to software configuration.

Implement intelligent load balancing to maximize GPU resource utilization.

Establish real‑time monitoring and alerting for system health.

Apply proven production‑grade performance tweaks for large language models.

Next steps include tailoring parameters to specific workloads, building CI/CD pipelines, exploring Kubernetes orchestration, and integrating AI model management platforms.

Load BalancingPerformance TuningCUDAAI DeploymentOllamamulti-GPU
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.