Deploying nvidia-docker2 for GPU Workloads on Large‑Scale Kubernetes Clusters
This article details the practical steps to install nvidia-docker2, configure Docker’s runtime, enable GPU support via Kubernetes device plugins, and verify GPU scheduling on a large Kubernetes cluster, providing code snippets and best‑practice recommendations for production environments.
nvidia-docker2 helps containerize legacy GPU‑accelerated applications, allocate specific GPU resources to containers, and share workloads across environments. The author records a hands‑on experience of using nvidia-docker2 in a large‑scale Kubernetes cluster.
1. Experimental Environment
CentOS Linux release 7.2.1511 (Core)
Kubernetes 1.9
GPU: NVIDIA Tesla K80
2. Installation (version 2.0)
Follow the official installation guide:
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f sudo yum remove nvidia-docker distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \ sudo tee /etc/yum.repos.d/nvidia-docker.repo sudo yum install -y nvidia-docker2 sudo pkill -SIGHUP dockerd docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Prerequisites include a Linux kernel >3.10, Docker >=1.12, NVIDIA GPU with architecture >Fermi, and NVIDIA drivers ~361.93.
After installation, configure Docker’s runtime by editing /etc/docker/daemon.json :
{ "default-runtime":"nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
Restart Docker:
systemctl restart docker
3. GPU on Kubernetes
Kubernetes has supported NVIDIA GPUs since v1.6 and AMD GPUs since v1.9, but multi‑container sharing of a single GPU is still unavailable. To schedule GPUs you must enable device plugins (feature‑gates before v1.10) and install GPU drivers and the device plugin on each node.
Deploy the NVIDIA device plugin using a DaemonSet:
apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: template: metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: name: nvidia-device-plugin-ds spec: tolerations: - key: CriticalAddonsOnly operator: Exists containers: - image: nvidia/k8s-device-plugin:1.9 name: nvidia-device-plugin-ctr securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins
Create the DaemonSet:
kubectl create -f nvidia-docker-plugin.yml
Verify that the nvidia-device-plugin-daemonset pods are running on each GPU node.
4. Test GPU Scheduling
Deploy a test pod that requests one GPU:
apiVersion: v1 kind: Pod metadata: name: cuda-vector-add spec: restartPolicy: OnFailure containers: - name: cuda-vector-add image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1 nodeSelector: accelerator: nvidia-tesla-k80
Create the pod:
kubectl create -f nvidia-docker2-gpu-pod.yml
Enter the container and run nvidia-smi to confirm the GPU is visible and only one GPU is allocated.
5. Summary
Using nvidia-docker 1.0 requires manually mounting GPU drivers as volumes, which is error‑prone. nvidia-docker 2.0 leverages Kubernetes device plugins, eliminating manual driver mounting and simplifying GPU resource management. Kubernetes also exposes a generic device‑plugin interface for other hardware resources, showcasing its extensibility.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.