How to Manage GPU Resources in Kubernetes: From Containers to Device Plugins
This article explains why managing GPUs with Kubernetes improves cost efficiency and deployment speed, details how to containerize GPU workloads, build appropriate images, configure NVIDIA drivers, and use Kubernetes Device Plugins and Extend Resources to schedule and monitor GPU resources, while also discussing current limitations and community solutions.
GPU Containerization
To run a GPU workload in a container you need to:
Build a container image that contains the required CUDA libraries and the machine‑learning framework (e.g., TensorFlow, PyTorch). Use an official NVIDIA CUDA base image and add only the additional packages you need.
Run the image with Docker (or NVIDIA‑Docker) and bind‑mount the host’s /dev device files and NVIDIA driver libraries into the container.
The host must have the NVIDIA driver installed. The driver stays on the host, while the CUDA toolkit and application binaries are packaged inside the container. At runtime the driver’s shared libraries are bind‑mounted, allowing different CUDA versions to coexist on the same node.
Running a GPU Container with Docker
docker run --gpus all \
-v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu \
-v /dev:/dev \
my-gpu-image:latestThe --gpus all flag (or the NVIDIA‑Docker runtime) ensures the GPU devices and driver files are visible inside the container.
Kubernetes GPU Management
Kubernetes schedules GPUs through two complementary mechanisms:
Extended Resources : Users can define a custom integer‑valued resource such as nvidia.com/gpu. The scheduler treats the resource as a count and can allocate it to pods.
Device Plugin Framework : A third‑party plugin runs on each node, reports the health and quantity of GPUs to the kubelet, and handles allocation requests.
Reporting Extended Resources Manually
If a device plugin is not used, the node’s status can be patched directly:
curl -X PATCH \
-H "Content-Type: application/strategic-merge-patch+json" \
--data '{"status":{"capacity":{"example.com/gpu":"1"}}}' \
https://KUBE_APISERVER/api/v1/nodes/NODE_NAME/statusWhen a Device Plugin is installed this step is performed automatically.
Device Plugin Lifecycle
Registration : The plugin registers its name, socket path, and API version with the kubelet.
Service Start : It starts a gRPC server to serve requests.
ListAndWatch : The kubelet opens a long‑running stream to receive device IDs and health status.
Allocate : When a pod requests a GPU, the kubelet calls Allocate; the plugin returns the device paths, driver directories, and any required environment variables.
Pod Scheduling with GPUs
A pod requests a GPU by adding a limit:
resources:
limits:
nvidia.com/gpu: 1The scheduler selects a node with enough reported GPUs, decrements the node’s capacity, and binds the pod. During container creation the kubelet contacts the appropriate Device Plugin, receives the device IDs, and mounts the corresponding device files and driver directories into the container.
Deploying GPU Support on a Kubernetes Node (CentOS example)
Install the NVIDIA driver (requires gcc and kernel headers).
Install the NVIDIA Docker runtime (package nvidia-docker2) and restart Docker. Verify the runtime with docker info (look for Runtimes: nvidia).
Deploy the NVIDIA Device Plugin as a DaemonSet:
git clone https://github.com/NVIDIA/k8s-device-plugin.git
kubectl apply -f k8s-device-plugin/nvidia-device-plugin.ymlThe DaemonSet runs the plugin on every GPU node, registers the resource nvidia.com/gpu, and starts the gRPC server.
Verification
After the DaemonSet is ready, inspect the node:
kubectl get node <em>NODE_NAME</em> -o jsonpath='{.status.capacity.nvidia\.com/gpu}'The output should be the number of GPUs (e.g., 2).
Sample Pod Manifest
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: tf
image: nvcr.io/nvidia/tensorflow:22.09-tf2-py3
resources:
limits:
nvidia.com/gpu: 1
command: ["/bin/bash", "-c", "nvidia-smi && python -c 'import tensorflow as tf; print(tf.__version__)'"]Deploy with kubectl apply -f gpu-pod.yaml. Inside the container nvidia-smi should list the allocated GPU (e.g., a T4), confirming that the device is isolated and visible.
Limitations of the Built‑in Device Plugin Model
The scheduler only tracks the number of GPUs, not their specific capabilities (e.g., NVLink connectivity, memory size). Complex placement requirements such as “two GPUs linked by NVLink” cannot be expressed. The Device Plugin API also lacks extensibility for custom parameters in Allocate or ListAndWatch, making heterogeneous or affinity‑aware scheduling difficult.
Community Extensions for Heterogeneous Scheduling
NVIDIA’s custom GPU‑aware scheduler (fork of upstream scheduler).
Alibaba Cloud’s GPU‑sharing scheduler for multi‑tenant environments.
Vendor‑specific plugins for RDMA, FPGA, and AMD GPUs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
