Cloud Native 8 min read

How to Run GPU‑Accelerated AI Workloads on Kubernetes

This article explains how Kubernetes supports GPU workloads for AI and machine learning, covering device plugins, pod GPU requests, oversubscription, security isolation, cloud‑provider node setup, and protecting GPU nodes from non‑GPU pods.

MaGe Linux Operations

Mar 5, 2024

How to Run GPU‑Accelerated AI Workloads on Kubernetes

Kubernetes is well‑suited for a wide range of containerized workloads, including AI and machine learning jobs that require GPUs, though there are many nuances.

Device Plugins

Kubernetes itself has no knowledge of GPUs; it relies on the extensible device‑plugin framework. Device plugins, typically deployed as daemonsets, advertise available resources (e.g., GPUs, InfiniBand) to the kubelet, which forwards the information to the API server for scheduling.

Requesting GPUs from Workloads

Containers request GPUs similarly to CPU or memory, but GPU resources require both a request and an equal limit, and the values must be integers.

Example pod requesting one Nvidia GPU:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      requests:
        cpu: 100m
        memory: 500Mi
      limits:
        memory: 1000Mi
        nvidia.com/gpu: 1

Oversubscription and Time‑Sharing

CPU time‑sharing is handled by cgroups. GPU time‑sharing is supported via two mechanisms:

Multi‑instance GPUs (e.g., Nvidia A100, H100) expose multiple virtual GPUs, allowing partitioning; supported by AWS, Azure, and GCP.

For single‑instance GPUs, Nvidia’s GPU scheduler slices time among workloads; supported by AWS and GCP.

While oversubscription is possible, workloads can be starved because GPU scheduling lacks a fully fair scheduler and cgroup priority.

Security/Isolation

Unlike CPUs, GPUs currently have no process or memory isolation; all workloads on a GPU share its memory, so GPUs should only be shared among mutually trusted workloads.

Creating GPU Nodes

The method for creating GPU‑enabled nodes varies by cloud provider.

AWS

Run an EKS‑accelerated Amazon Linux AMI with pre‑installed Nvidia drivers; you must install the Nvidia device plugin yourself.

Run Nvidia’s GPU Operator on the node group; upgrades are manual.

Azure

Create a GPU node pool that includes drivers but requires manual installation of the Nvidia device plugin.

Use the AKS GPU preview image, which bundles drivers and the device plugin; upgrades are manual.

Run Nvidia’s GPU Operator on the node group, which handles everything.

GCP

Let Google manage GPU driver installation and the device plugin; GKE can also auto‑upgrade nodes.

Manage GPU drivers and the device plugin yourself.

Protecting GPU Nodes from Non‑GPU Workloads

To prevent non‑GPU workloads from being scheduled on GPU nodes, apply taints and tolerations when creating node pools. GKE can add the taint automatically for non‑GPU node pools; other providers require manual configuration.

Example pod toleration for the "nvidia.com/gpu" taint:

apiVersion: v1
kind: Pod
metadata:
  name: my-gpu-pod
spec:
  containers:
  - name: my-gpu-container
    image: nvidia/cuda:11.0.3-runtime-ubuntu20.04
    command: ["/bin/bash", "-c", "--"]
    args: ["while true; do sleep 600; done;"]
    resources:
      requests:
        cpu: 100m
        memory: 500Mi
      limits:
        memory: 1000Mi
        nvidia.com/gpu: 1
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"

As AI and machine‑learning workloads continue to grow, consider running them on Kubernetes rather than more expensive proprietary cloud services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native machine learning Kubernetes GPU AI workloads Device Plugin

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.