Cloud Native 14 min read

How to Install NVIDIA Docker Plugin and Enable GPU Access in Kubernetes

This guide walks through checking the system environment, installing the NVIDIA Docker plugin, configuring Docker to use the NVIDIA runtime, verifying GPU access with Docker, deploying the NVIDIA device plugin on a Kubernetes cluster, and running GPU‑accelerated workloads in pods.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Install NVIDIA Docker Plugin and Enable GPU Access in Kubernetes

Reference:

Install Docker plugin: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Ubuntu using Docker to call GPU: https://blog.csdn.net/dw14132124/article/details/140534628

https://www.cnblogs.com/li508q/p/18444582

1. Environment Check

System environment

# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy
# cat /etc/redhat-release 
Rocky Linux release 9.3 (Blue Onyx)

Software environment

# kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.25.16
WARNING: version difference between client (1.30) and server (1.25) exceeds the supported minor version skew of +/-1

2. Install NVIDIA Docker Plugin on a GPU‑enabled host (K8s node)

Set up the repository:

# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Enable experimental packages:

# sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update the package index: # sudo apt-get update Install the toolkit:

# sudo apt-get install -y nvidia-container-toolkit

Configure Docker to use NVIDIA runtime:

# sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that docker daemon be restarted.

The command adds a runtimes entry to /etc/docker/daemon.json:

{
    "insecure-registries": ["192.168.3.61"],
    "registry-mirrors": [
        "https://7sl94zzz.mirror.aliyuncs.com",
        "https://hub.atomgit.com",
        "https://docker.awsl9527.cn"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Restart Docker:

# systemctl daemon-reload
# systemctl restart docker

3. Verify Docker GPU Access

# docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

The output shows detailed GPU information, confirming that the container can access the NVIDIA GPU and that the NVIDIA Container Toolkit is correctly installed.

4. Deploy NVIDIA Device Plugin in a Kubernetes Cluster

On the master node, install the plugin:

# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

Plugin manifest (nvidia-device-plugin.yml):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Deploy the DaemonSet and check logs:

# kubectl logs -f nvidia-device-plugin-daemonset-xxxx -n kube-system

If the node has no GPU or the NVIDIA toolkit is not configured, the plugin will report errors such as “Incompatible strategy detected auto” and suggest checking the prerequisites.

Key configuration in /etc/docker/daemon.json for GPU nodes:

{
    "insecure-registries": ["192.168.3.61"],
    "registry-mirrors": [
        "https://7sl94zzz.mirror.aliyuncs.com",
        "https://hub.atomgit.com",
        "https://docker.awsl9527.cn"
    ],
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/bin/nvidia-container-runtime"
        }
    }
}
image
image

5. Test GPU Access with a Pod

Create a test pod definition (gpu_test.yaml):

apiVersion: v1
kind: Pod
metadata:
  name: ffmpeg-pod
spec:
  nodeName: aiserver003087   # specify a GPU node
  containers:
  - name: ffmpeg-container
    image: nightseas/ffmpeg:latest
    command: ["/bin/bash", "-c", "tail -f /dev/null"]
    resources:
      limits:
        nvidia.com/gpu: 1   # request 1 GPU
# kubectl apply -f gpu_test.yaml
pod/ffmpeg-pod configured

Copy a video into the pod and run an FFmpeg conversion using GPU acceleration:

# kubectl cp test.mp4 ffmpeg-pod:/root
# kubectl exec -it ffmpeg-pod bash
# ffmpeg -hwaccel cuvid -c:v h264_cuvid -i test.mp4 -vf scale_npp=1280:720 -vcodec h264_nvenc out.mp4

If out.mp4 is produced, GPU access is successful.

6. Node Labeling and DaemonSet Scheduling

Label GPU nodes so that the DaemonSet runs only on them: # kubectl label nodes aiserver003087 gpu=true Update the DaemonSet manifest to include a node selector:

spec:
  nodeSelector:
    gpu: "true"
Note: The selector value must be quoted ("true"); otherwise kubectl apply will reject the boolean.
image
image

Modify the pod definition to use the node selector instead of a fixed node name:

spec:
  containers:
  - name: ffmpeg-container
    image: nightseas/ffmpeg:latest
    command: ["/bin/bash", "-c", "tail -f /dev/null"]
    resources:
      limits:
        nvidia.com/gpu: 1
  nodeSelector:
    gpu: "true"
Be sure to quote the selector value "true".

When multiple GPUs are present, you can specify a particular device by setting the appropriate environment variable or command‑line option (e.g., CUDA_VISIBLE_DEVICES=7 to use the 8th GPU, since indexing starts at 0).

image
image

For more details, see the original article: https://www.cnblogs.com/minseo/p/18460107

(Copyright belongs to the original author, please delete if infringed.)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerGPU AccelerationKubernetesContainer ToolkitGPUk8sNVIDIA Docker
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.