Cloud Native 22 min read

Harnessing Nvidia GPUs in Kubernetes: Virtualization, Scheduling & Best Practices

This article explains how to combine Nvidia GPUs with Kubernetes, covering CUDA toolkits, device plugins, GPU virtualization techniques such as Time‑Slicing, MPS and MIG, and advanced scheduling options like Volcano, while also outlining practical deployment steps and performance considerations.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Harnessing Nvidia GPUs in Kubernetes: Virtualization, Scheduling & Best Practices

Terminology

CUDA (Compute Unified Device Architecture) is Nvidia's parallel computing platform and programming model that enables GPU‑accelerated applications. RootFS is the top‑level Linux filesystem loaded at boot. GPU architecture names such as Volta , Pascal and Kepler denote successive Nvidia generations.

GPU Virtualization Framework on Kubernetes

Beyond hardware‑level virtualization, most solutions intercept CUDA calls. Examples include Alibaba cGPU , Baidu qGPU , Volcano mGPU and Lingque Cloud vGPU . Nvidia’s own stack (driver + CUDA toolkit + nvidia‑container‑runtime) combined with Lingque’s enhancements provides the most complete feature set for containerised AI workloads.

Container‑side: CUDA Toolkit

A typical GPU container stack consists of the business application, the CUDA Toolkit, and the container RootFS, running on a host that has Nvidia drivers and one or more GPUs.

GPU stack diagram
GPU stack diagram

The CUDA Toolkit adds three key components:

nvidia-container-runtime (shim) : a lightweight wrapper around the native runC that injects Nvidia‑specific hooks and device mounts.

nvidia-container-runtime-hook : a pre‑start hook executed by runC to modify the container spec and request GPU devices.

nvidia-container library and CLI : a library and command‑line tool that automatically configures containers to use Nvidia GPUs, independent of the underlying runtime.

CUDA toolkit before integration
CUDA toolkit before integration
CUDA toolkit after integration
CUDA toolkit after integration

Kubernetes Device Plugin

A Device Plugin extends the Kubernetes scheduler to expose hardware resources (GPU, FPGA, TPU, etc.) as native resources. The plugin runs a gRPC server on a Unix socket under /var/lib/kubelet/device-plugins/ and implements the following service definition:

service DevicePlugin {
  rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}
  rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}
  rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
  rpc GetPreferredAllocation(PreferredAllocationRequest) returns (PreferredAllocationResponse) {}
  rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {}
}

Typical implementation steps:

Initialize the plugin and verify that the GPU devices are ready.

Start the gRPC service on /var/lib/kubelet/device-plugins/kubelet.sock.

Register the plugin with the kubelet before the service begins accepting connections.

Handle Allocate requests to inject device nodes, environment variables, mounts, or CDI specifications into the pod.

GPU Scheduling and Enhancements

The native Kubernetes scheduler can allocate GPUs using the default “best‑effort” policy. For more advanced placement—e.g., co‑locating related pods on the same GPU, balancing workloads across multiple GPUs, or enforcing QoS—custom schedulers or extensions are required. The Volcano scheduler provides a rich set of policies (gang‑scheduling, fair‑share, queue, preemption, topology‑aware, etc.) that are well‑suited for high‑performance AI workloads.

Pod scheduling flow
Pod scheduling flow

Virtualization Techniques Comparison

Time‑Slicing : Temporal partitioning that shares a single GPU among multiple processes by rapidly switching contexts. Unlimited partitions, no memory isolation, and no QoS guarantees.

MPS (Multi‑Process Service) : Logical partitioning with up to 48 partitions, provides memory protection and reduces context‑switch overhead. Supported on Kepler and newer architectures (compute capability > 3.5) and Linux only.

MIG (Multi‑Instance GPU) : Physical partitioning introduced with Nvidia Ampere, allowing up to 7 isolated GPU instances, each with dedicated memory, SMs and QoS guarantees.

Time‑Slicing

Multiple CUDA applications share a GPU by rapidly switching contexts. This incurs extra latency, lacks memory isolation, and can cause OOM errors when one process exhausts memory.

MPS

MPS aggregates multiple CUDA streams or processes into a single GPU context, improving utilization and reducing context‑switch overhead. It offers memory protection but is limited to Linux and requires compute capability > 3.5.

MIG

MIG splits a physical GPU into up to seven independent instances, each with its own memory, SM units and compute resources. It provides full memory isolation, QoS guarantees, and fault isolation, making it suitable for multi‑tenant environments and mixed workloads.

MIG partition example
MIG partition example

Sample Pod Specification for MPS

apiVersion: v1
kind: Pod
metadata:
  name: mps-gpu-pod
spec:
  restartPolicy: Never
  hostIPC: true
  securityContext:
    runAsUser: 1000
  containers:
    - name: cuda-container
      image: myrepo/cuda:latest
      resources:
        limits:
          nvidia.com/mps-core: 50
          nvidia.com/mps-memory: 8

References

CUDA Toolkit documentation: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html

Kubernetes Device Plugin guide: https://kubernetes.io/zh-cn/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/

Nvidia MIG user guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

Nvidia MPS documentation: https://docs.nvidia.com/deploy/mps/index.html

Kubernetes scheduler framework: https://kubernetes.io/zh-cn/docs/concepts/scheduling-eviction/scheduling-framework/

Volcano scheduler documentation: https://volcano.sh/zh/docs/

cloud-nativeKubernetesGPU virtualizationMPSDevice PluginMIGNvidia GPU
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.