Operations 8 min read

Deploying nvidia-docker2 for GPU Workloads on Large‑Scale Kubernetes Clusters

This article details the practical steps to install nvidia-docker2, configure Docker’s runtime, enable GPU support via Kubernetes device plugins, and verify GPU scheduling on a large Kubernetes cluster, providing code snippets and best‑practice recommendations for production environments.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Deploying nvidia-docker2 for GPU Workloads on Large‑Scale Kubernetes Clusters

nvidia-docker2 helps containerize legacy GPU‑accelerated applications, allocate specific GPU resources to containers, and share workloads across environments. The author records a hands‑on experience of using nvidia-docker2 in a large‑scale Kubernetes cluster.

1. Experimental Environment

CentOS Linux release 7.2.1511 (Core)

Kubernetes 1.9

GPU: NVIDIA Tesla K80

2. Installation (version 2.0)

Follow the official installation guide:

docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
    sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

Prerequisites include a Linux kernel >3.10, Docker >=1.12, NVIDIA GPU with architecture >Fermi, and NVIDIA drivers ~361.93.

After installation, configure Docker’s runtime by editing /etc/docker/daemon.json:

{
  "default-runtime":"nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

Restart Docker: systemctl restart docker 3. GPU on Kubernetes

Kubernetes has supported NVIDIA GPUs since v1.6 and AMD GPUs since v1.9, but multi‑container sharing of a single GPU is still unavailable. To schedule GPUs you must enable device plugins (feature‑gates before v1.10) and install GPU drivers and the device plugin on each node.

Deploy the NVIDIA device plugin using a DaemonSet:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      containers:
      - image: nvidia/k8s-device-plugin:1.9
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Create the DaemonSet: kubectl create -f nvidia-docker-plugin.yml Verify that the nvidia-device-plugin-daemonset pods are running on each GPU node.

4. Test GPU Scheduling

Deploy a test pod that requests one GPU:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "k8s.gcr.io/cuda-vector-add:v0.1"
    resources:
      limits:
        nvidia.com/gpu: 1
    nodeSelector:
      accelerator: nvidia-tesla-k80

Create the pod: kubectl create -f nvidia-docker2-gpu-pod.yml Enter the container and run nvidia-smi to confirm the GPU is visible and only one GPU is allocated.

5. Summary

Using nvidia-docker 1.0 requires manually mounting GPU drivers as volumes, which is error‑prone. nvidia-docker 2.0 leverages Kubernetes device plugins, eliminating manual driver mounting and simplifying GPU resource management. Kubernetes also exposes a generic device‑plugin interface for other hardware resources, showcasing its extensibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerKubernetesGPUGPU schedulingDevice Pluginnvidia-docker2
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.