How to Deploy NVIDIA GPU Operator on Kubernetes for GPU‑Accelerated Rendering
Learn step‑by‑step how to install NVIDIA’s GPU‑operator on a Kubernetes cluster, configure GPU nodes, deploy a Blender workload for GPU‑accelerated rendering, monitor GPU metrics, and troubleshoot common issues, enabling seamless GPU scheduling and graphics rendering in cloud‑native environments.
Background
We need to deploy a business application on a Kubernetes platform as a container, using NVIDIA’s open‑source GPU‑operator to provide GPU scheduling and rendering capabilities.
Solution Overview
Deploy the full GPU‑operator stack on the Kubernetes cluster; it installs NVIDIA drivers, provides a device plugin for GPU scheduling, and exposes GPU‑related metrics.
Implementation Steps
Before installing the GPU‑operator, ensure the base environment matches:
GPU model: Nvidia T4
GPU node OS: Ubuntu 22.04
Container engine: Docker
Install NVIDIA GPU‑operator
Refer to NVIDIA’s official documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#
Configure Helm client (the operator requires Helm). See Huawei Cloud guide.
Add NVIDIA Helm repository.
<code>helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update</code>Specify driver version and install GPU‑operator.
<code>helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.version=470.141.03</code>Note: Some images may fail to pull; pre‑pull them to the nodes.
Observe GPU‑operator status.
The NVIDIA driver compilation takes time; monitor the operator’s pods.
After GPU nodes are added, daemonsets install drivers and report GPU resources.
Once components are running, if any daemonset pod remains unready, restart the pod manually.
Check GPU node status.
The node shows “GPU driver not ready”; clicking the node reveals GPU quota.
You can also verify via the node’s YAML.
Create workload to request GPU resources and run rendering task
Deploy a Blender workload using the following YAML:
<code>apiVersion: apps/v1
kind: Deployment
metadata:
labels:
version: v1
name: blender
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: blender
version: v1
template:
metadata:
labels:
app: blender
version: v1
spec:
containers:
- image: swr.cn-east-3.myhuaweicloud.com/hz-cloud/blender:4.1.1
imagePullPolicy: IfNotPresent
name: container-1
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: 250m
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
labels:
app: blender
version: v1
name: blender
namespace: default
spec:
ports:
- name: cce-service-0
port: 3000
protocol: TCP
targetPort: 3000
selector:
app: blender
version: v1
type: NodePort</code>Wait for the pod to become ready:
Log into the pod and confirm the GPU is attached.
Check environment variable inside the container:
NVIDIA_DRIVER_CAPABILITIES=allConfigure rendering properties in the Blender UI.
Select Cycles engine and GPU compute device.
Confirm Blender detects the Nvidia T4 GPU.
Run the rendering job.
Rendering progress is displayed:
Monitor GPU usage inside the container.
Use
watch -d nvidia-smito see dynamic GPU utilization, memory and compute usage increasing.
Viewing GPU‑related metrics
The gpu‑operator deploys the dcgm‑exporter daemonset, which exposes GPU metrics on port 9400.
Manually query metrics with
curl POD_IP:9400/metrics:
If the cluster integrates Prometheus, create a ServiceMonitor to scrape these metrics for continuous GPU usage monitoring.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.