Deploying NVIDIA‑Docker 2.0 on Large‑Scale Kubernetes: A Step‑by‑Step Guide
This tutorial walks through installing NVIDIA‑Docker 2.0, configuring Docker’s runtime, deploying the NVIDIA device plugin on a Kubernetes 1.9 cluster, and testing GPU‑enabled pods, highlighting the advantages over the legacy nvidia‑docker 1.0 approach.
1. Experiment Environment
CentOS Linux release 7.2.1511 (Core)
Kubernetes: 1.9
GPU: nvidia‑tesla‑k80
2. Installation (version 2.0)
Follow the official installation guide. Prerequisites:
GNU/Linux x86_64 with kernel version > 3.10
Docker >= 1.12
NVIDIA GPU with Architecture > Fermi (2.1)
NVIDIA drivers ~= 361.93 (untested on older versions)
<code># Remove existing nvidia-docker 1.0
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo yum remove nvidia-docker
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
# Install nvidia-docker2 and reload daemon
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
# Test nvidia‑smi with the official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
</code>Configure Docker to use the NVIDIA container runtime:
<code>{
"default-runtime":"nvidia",
"runtimes":{
"nvidia":{
"path":"/usr/bin/nvidia-container-runtime",
"runtimeArgs":[]
}
}
}
</code>Restart Docker:
<code>systemctl restart docker</code>3. GPU on Kubernetes
Kubernetes has supported NVIDIA GPUs since v1.6 and AMD GPUs since v1.9. Each container can request whole GPUs, but fractional requests or sharing a single GPU among multiple containers are not supported.
4. Deploying the NVIDIA Device Plugin
Enable GPU support (feature‑gate before v1.10) and install NVIDIA drivers and the device plugin on each node. Deploy the plugin with the following DaemonSet manifest:
<code>apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: CriticalAddonsOnly
operator: Exists
containers:
- image: nvidia/k8s-device-plugin:1.9
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
</code>Create the plugin resources:
<code>kubectl create -f nvidia-docker-plugin.yml</code>5. Test GPU Pod
Deploy a test pod that requests one GPU:
<code>apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-k80
</code>Run the pod and verify that the GPU device and CUDA libraries are available inside the container.
6. Summary
Using nvidia‑docker 1.0 requires manually mounting GPU drivers, whereas nvidia‑docker 2.0 leverages the Kubernetes device plugin, simplifying GPU provisioning. The device‑plugin model and the extensible container‑runtime interface demonstrate Kubernetes’ powerful extensibility for integrating external resources.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.