How to Run GPU‑Accelerated AI Workloads on Kubernetes
This article explains how Kubernetes supports GPU workloads for AI and machine learning, covering device plugins, pod GPU requests, oversubscription, security isolation, cloud‑provider node setup, and protecting GPU nodes from non‑GPU pods.
Kubernetes is well‑suited for a wide range of containerized workloads, including AI and machine learning jobs that require GPUs, though there are many nuances.
Device Plugins
Kubernetes itself has no knowledge of GPUs; it relies on the extensible device‑plugin framework. Device plugins, typically deployed as daemonsets, advertise available resources (e.g., GPUs, InfiniBand) to the kubelet, which forwards the information to the API server for scheduling.
Requesting GPUs from Workloads
Containers request GPUs similarly to CPU or memory, but GPU resources require both a request and an equal limit, and the values must be integers.
Example pod requesting one Nvidia GPU:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 600; done;"]
resources:
requests:
cpu: 100m
memory: 500Mi
limits:
memory: 1000Mi
nvidia.com/gpu: 1Oversubscription and Time‑Sharing
CPU time‑sharing is handled by cgroups. GPU time‑sharing is supported via two mechanisms:
Multi‑instance GPUs (e.g., Nvidia A100, H100) expose multiple virtual GPUs, allowing partitioning; supported by AWS, Azure, and GCP.
For single‑instance GPUs, Nvidia’s GPU scheduler slices time among workloads; supported by AWS and GCP.
While oversubscription is possible, workloads can be starved because GPU scheduling lacks a fully fair scheduler and cgroup priority.
Security/Isolation
Unlike CPUs, GPUs currently have no process or memory isolation; all workloads on a GPU share its memory, so GPUs should only be shared among mutually trusted workloads.
Creating GPU Nodes
The method for creating GPU‑enabled nodes varies by cloud provider.
AWS
Run an EKS‑accelerated Amazon Linux AMI with pre‑installed Nvidia drivers; you must install the Nvidia device plugin yourself.
Run Nvidia’s GPU Operator on the node group; upgrades are manual.
Azure
Create a GPU node pool that includes drivers but requires manual installation of the Nvidia device plugin.
Use the AKS GPU preview image, which bundles drivers and the device plugin; upgrades are manual.
Run Nvidia’s GPU Operator on the node group, which handles everything.
GCP
Let Google manage GPU driver installation and the device plugin; GKE can also auto‑upgrade nodes.
Manage GPU drivers and the device plugin yourself.
Protecting GPU Nodes from Non‑GPU Workloads
To prevent non‑GPU workloads from being scheduled on GPU nodes, apply taints and tolerations when creating node pools. GKE can add the taint automatically for non‑GPU node pools; other providers require manual configuration.
Example pod toleration for the "nvidia.com/gpu" taint:
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
command: ["/bin/bash", "-c", "--"]
args: ["while true; do sleep 600; done;"]
resources:
requests:
cpu: 100m
memory: 500Mi
limits:
memory: 1000Mi
nvidia.com/gpu: 1
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"As AI and machine‑learning workloads continue to grow, consider running them on Kubernetes rather than more expensive proprietary cloud services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
