Deploying nvidia-docker2 for GPU Workloads on Large‑Scale Kubernetes Clusters
This article details the practical steps to install nvidia-docker2, configure Docker’s runtime, enable GPU support via Kubernetes device plugins, and verify GPU scheduling on a large Kubernetes cluster, providing code snippets and best‑practice recommendations for production environments.
nvidia-docker2 helps containerize legacy GPU‑accelerated applications, allocate specific GPU resources to containers, and share workloads across environments. The author records a hands‑on experience of using nvidia-docker2 in a large‑scale Kubernetes cluster.
1. Experimental Environment
CentOS Linux release 7.2.1511 (Core)
Kubernetes 1.9
GPU: NVIDIA Tesla K80
2. Installation (version 2.0)
Follow the official installation guide:
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smiPrerequisites include a Linux kernel >3.10, Docker >=1.12, NVIDIA GPU with architecture >Fermi, and NVIDIA drivers ~361.93.
After installation, configure Docker’s runtime by editing /etc/docker/daemon.json:
{
"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}Restart Docker: systemctl restart docker 3. GPU on Kubernetes
Kubernetes has supported NVIDIA GPUs since v1.6 and AMD GPUs since v1.9, but multi‑container sharing of a single GPU is still unavailable. To schedule GPUs you must enable device plugins (feature‑gates before v1.10) and install GPU drivers and the device plugin on each node.
Deploy the NVIDIA device plugin using a DaemonSet:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: CriticalAddonsOnly
operator: Exists
containers:
- image: nvidia/k8s-device-plugin:1.9
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-pluginsCreate the DaemonSet: kubectl create -f nvidia-docker-plugin.yml Verify that the nvidia-device-plugin-daemonset pods are running on each GPU node.
4. Test GPU Scheduling
Deploy a test pod that requests one GPU:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-k80Create the pod: kubectl create -f nvidia-docker2-gpu-pod.yml Enter the container and run nvidia-smi to confirm the GPU is visible and only one GPU is allocated.
5. Summary
Using nvidia-docker 1.0 requires manually mounting GPU drivers as volumes, which is error‑prone. nvidia-docker 2.0 leverages Kubernetes device plugins, eliminating manual driver mounting and simplifying GPU resource management. Kubernetes also exposes a generic device‑plugin interface for other hardware resources, showcasing its extensibility.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
