Cloud Native 8 min read

How to Deploy NVIDIA GPU Operator on Kubernetes for GPU‑Accelerated Rendering

Learn step‑by‑step how to install NVIDIA’s GPU‑operator on a Kubernetes cluster, configure GPU nodes, deploy a Blender workload for GPU‑accelerated rendering, monitor GPU metrics, and troubleshoot common issues, enabling seamless GPU scheduling and graphics rendering in cloud‑native environments.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Deploy NVIDIA GPU Operator on Kubernetes for GPU‑Accelerated Rendering

Background

We need to deploy a business application on a Kubernetes platform as a container, using NVIDIA’s open‑source GPU‑operator to provide GPU scheduling and rendering capabilities.

Solution Overview

Deploy the full GPU‑operator stack on the Kubernetes cluster; it installs NVIDIA drivers, provides a device plugin for GPU scheduling, and exposes GPU‑related metrics.

Implementation Steps

Before installing the GPU‑operator, ensure the base environment matches:

GPU model: Nvidia T4

GPU node OS: Ubuntu 22.04

Container engine: Docker

Install NVIDIA GPU‑operator

Refer to NVIDIA’s official documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#

Configure Helm client (the operator requires Helm). See Huawei Cloud guide.

Add NVIDIA Helm repository.

<code>helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update</code>

Specify driver version and install GPU‑operator.

<code>helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.version=470.141.03</code>

Note: Some images may fail to pull; pre‑pull them to the nodes.

Observe GPU‑operator status.

The NVIDIA driver compilation takes time; monitor the operator’s pods.

After GPU nodes are added, daemonsets install drivers and report GPU resources.

Once components are running, if any daemonset pod remains unready, restart the pod manually.

Check GPU node status.

The node shows “GPU driver not ready”; clicking the node reveals GPU quota.

You can also verify via the node’s YAML.

Create workload to request GPU resources and run rendering task

Deploy a Blender workload using the following YAML:

<code>apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    version: v1
  name: blender
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: blender
      version: v1
  template:
    metadata:
      labels:
        app: blender
        version: v1
    spec:
      containers:
      - image: swr.cn-east-3.myhuaweicloud.com/hz-cloud/blender:4.1.1
        imagePullPolicy: IfNotPresent
        name: container-1
        resources:
          limits:
            nvidia.com/gpu: "1"
          requests:
            cpu: 250m
            memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: blender
    version: v1
  name: blender
  namespace: default
spec:
  ports:
  - name: cce-service-0
    port: 3000
    protocol: TCP
    targetPort: 3000
  selector:
    app: blender
    version: v1
  type: NodePort</code>

Wait for the pod to become ready:

Log into the pod and confirm the GPU is attached.

Check environment variable inside the container:

NVIDIA_DRIVER_CAPABILITIES=all

Configure rendering properties in the Blender UI.

Select Cycles engine and GPU compute device.

Confirm Blender detects the Nvidia T4 GPU.

Run the rendering job.

Rendering progress is displayed:

Monitor GPU usage inside the container.

Use

watch -d nvidia-smi

to see dynamic GPU utilization, memory and compute usage increasing.

Viewing GPU‑related metrics

The gpu‑operator deploys the dcgm‑exporter daemonset, which exposes GPU metrics on port 9400.

Manually query metrics with

curl POD_IP:9400/metrics

:

If the cluster integrates Prometheus, create a ServiceMonitor to scrape these metrics for continuous GPU usage monitoring.

Cloud NativeKubernetesNVIDIABlenderGPU renderingGPU-operator
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.