Testing NVIDIA GPU DRA on Kubernetes 1.31
This guide walks through setting up an Ubuntu 22.04 environment, installing Docker, kind, the NVIDIA Container Toolkit, configuring the NVIDIA runtime as default, building and deploying the Kubernetes DRA driver, and running three demo scenarios that demonstrate GPU sharing across containers and pods in a Kubernetes 1.31 cluster.
Prerequisites
Operating system: Ubuntu 22.04<br/>Container runtime: Docker
Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.shInstall kind
# For AMD64 / x86_64
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.25.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kindInstall NVIDIA Container Toolkit
Add the NVIDIA repository and key
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listUpdate package index sudo apt-get update Install the toolkit
sudo apt-get install -y nvidia-container-toolkitConfigure Docker to use the NVIDIA runtime as default
sudo nvidia-ctk runtime configure --runtime=docker --set-as-defaultRestart Docker
sudo systemctl restart dockerEnable device visibility via volume mounts
# /etc/nvidia-container-runtime/config.toml
sudo nvidia-ctk config --in-place --set accept-nvidia-visible-devices-as-volume-mounts=trueSet up a Kind cluster and install the DRA driver
Clone the DRA driver repository
git clone https://github.com/NVIDIA/k8s-dra-driver.git
cd k8s-dra-driverCreate a Kind cluster for the demo ./demo/clusters/kind/create-cluster.sh Install kubectl and
helm # kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
# helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.shBuild the NVIDIA DRA driver image (fallback command if earlier steps fail)
make build-image
./demo/clusters/kind/build-dra-driver.sh
# Load the image into Kind
k8s-dra-driver-cluster-image nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0-ubuntu20.04 --name k8s-dra-driver-clusterInstall the DRA driver into the cluster
./demo/clusters/kind/install-dra-driver.shVerify installation
After a successful install, two pods should be running in the nvidia-dra-driver namespace:
kubectl get pods -n nvidia-dra-driver
NAME READY STATUS RESTARTS AGE
nvidia-k8s-dra-driver-kubelet-plugin-t5qgz 1/1 Running 0 44sRun demo scenarios
Case 1 – Two containers in the same pod share one GPU
kubectl apply --filename=demo/specs/quickstart/gpu-test2.yamlCase 2 – Two pods share the same GPU
kubectl apply --filename=demo/specs/quickstart/gpu-test3.yamlCase 3 – Two pods share a specific GPU model (Tesla T4)
---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test3
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
namespace: gpu-test3
name: single-gpu
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
selectors:
- cel:
expression: |
device.attributes['gpu.nvidia.com'].productName=='Tesla T4'
---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test3
name: pod1
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: single-gpu
---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test3
name: pod2
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: single-gpuThese examples demonstrate how the Kubernetes Device Resource Allocation (DRA) feature can be used to allocate NVIDIA GPUs to multiple containers and pods, including selection by specific GPU model.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
