Cloud Native 7 min read

Testing NVIDIA GPU DRA on Kubernetes 1.31

This guide walks through setting up an Ubuntu 22.04 environment, installing Docker, kind, the NVIDIA Container Toolkit, configuring the NVIDIA runtime as default, building and deploying the Kubernetes DRA driver, and running three demo scenarios that demonstrate GPU sharing across containers and pods in a Kubernetes 1.31 cluster.

Infra Learning Club
Infra Learning Club
Infra Learning Club
Testing NVIDIA GPU DRA on Kubernetes 1.31

Prerequisites

Operating system: Ubuntu 22.04<br/>Container runtime: Docker

Install Docker

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

Install kind

# For AMD64 / x86_64
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.25.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

Install NVIDIA Container Toolkit

Add the NVIDIA repository and key

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update package index sudo apt-get update Install the toolkit

sudo apt-get install -y nvidia-container-toolkit

Configure Docker to use the NVIDIA runtime as default

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default

Restart Docker

sudo systemctl restart docker

Enable device visibility via volume mounts

# /etc/nvidia-container-runtime/config.toml
sudo nvidia-ctk config --in-place --set accept-nvidia-visible-devices-as-volume-mounts=true

Set up a Kind cluster and install the DRA driver

Clone the DRA driver repository

git clone https://github.com/NVIDIA/k8s-dra-driver.git
cd k8s-dra-driver

Create a Kind cluster for the demo ./demo/clusters/kind/create-cluster.sh Install kubectl and

helm
# kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
# helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Build the NVIDIA DRA driver image (fallback command if earlier steps fail)

make build-image
./demo/clusters/kind/build-dra-driver.sh
# Load the image into Kind
k8s-dra-driver-cluster-image nvcr.io/nvidia/cloud-native/k8s-dra-driver:v0.1.0-ubuntu20.04 --name k8s-dra-driver-cluster

Install the DRA driver into the cluster

./demo/clusters/kind/install-dra-driver.sh

Verify installation

After a successful install, two pods should be running in the nvidia-dra-driver namespace:

kubectl get pods -n nvidia-dra-driver
NAME                                         READY   STATUS    RESTARTS   AGE
nvidia-k8s-dra-driver-kubelet-plugin-t5qgz   1/1     Running   0          44s

Run demo scenarios

Case 1 – Two containers in the same pod share one GPU

kubectl apply --filename=demo/specs/quickstart/gpu-test2.yaml

Case 2 – Two pods share the same GPU

kubectl apply --filename=demo/specs/quickstart/gpu-test3.yaml

Case 3 – Two pods share a specific GPU model (Tesla T4)

---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test3
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  namespace: gpu-test3
  name: single-gpu
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: gpu.nvidia.com
      selectors:
      - cel:
          expression: |
            device.attributes['gpu.nvidia.com'].productName=='Tesla T4'
---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test3
  name: pod1
  labels:
    app: pod
spec:
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimName: single-gpu
---
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test3
  name: pod2
  labels:
    app: pod
spec:
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimName: single-gpu

These examples demonstrate how the Kubernetes Device Resource Allocation (DRA) feature can be used to allocate NVIDIA GPUs to multiple containers and pods, including selection by specific GPU model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerKubernetesUbuntukindNVIDIA GPUDevice Resource Allocation
Infra Learning Club
Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.