Harnessing Nvidia GPUs in Kubernetes: Virtualization, Scheduling & Best Practices
This article explains how to combine Nvidia GPUs with Kubernetes, covering CUDA toolkits, device plugins, GPU virtualization techniques such as Time‑Slicing, MPS and MIG, and advanced scheduling options like Volcano, while also outlining practical deployment steps and performance considerations.
Terminology
CUDA (Compute Unified Device Architecture) is Nvidia's parallel computing platform and programming model that enables GPU‑accelerated applications. RootFS is the top‑level Linux filesystem loaded at boot. GPU architecture names such as Volta , Pascal and Kepler denote successive Nvidia generations.
GPU Virtualization Framework on Kubernetes
Beyond hardware‑level virtualization, most solutions intercept CUDA calls. Examples include Alibaba cGPU , Baidu qGPU , Volcano mGPU and Lingque Cloud vGPU . Nvidia’s own stack (driver + CUDA toolkit + nvidia‑container‑runtime) combined with Lingque’s enhancements provides the most complete feature set for containerised AI workloads.
Container‑side: CUDA Toolkit
A typical GPU container stack consists of the business application, the CUDA Toolkit, and the container RootFS, running on a host that has Nvidia drivers and one or more GPUs.
The CUDA Toolkit adds three key components:
nvidia-container-runtime (shim) : a lightweight wrapper around the native runC that injects Nvidia‑specific hooks and device mounts.
nvidia-container-runtime-hook : a pre‑start hook executed by runC to modify the container spec and request GPU devices.
nvidia-container library and CLI : a library and command‑line tool that automatically configures containers to use Nvidia GPUs, independent of the underlying runtime.
Kubernetes Device Plugin
A Device Plugin extends the Kubernetes scheduler to expose hardware resources (GPU, FPGA, TPU, etc.) as native resources. The plugin runs a gRPC server on a Unix socket under /var/lib/kubelet/device-plugins/ and implements the following service definition:
service DevicePlugin {
rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}
rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}
rpc Allocate(AllocateRequest) returns (AllocateResponse) {}
rpc GetPreferredAllocation(PreferredAllocationRequest) returns (PreferredAllocationResponse) {}
rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {}
}Typical implementation steps:
Initialize the plugin and verify that the GPU devices are ready.
Start the gRPC service on /var/lib/kubelet/device-plugins/kubelet.sock.
Register the plugin with the kubelet before the service begins accepting connections.
Handle Allocate requests to inject device nodes, environment variables, mounts, or CDI specifications into the pod.
GPU Scheduling and Enhancements
The native Kubernetes scheduler can allocate GPUs using the default “best‑effort” policy. For more advanced placement—e.g., co‑locating related pods on the same GPU, balancing workloads across multiple GPUs, or enforcing QoS—custom schedulers or extensions are required. The Volcano scheduler provides a rich set of policies (gang‑scheduling, fair‑share, queue, preemption, topology‑aware, etc.) that are well‑suited for high‑performance AI workloads.
Virtualization Techniques Comparison
Time‑Slicing : Temporal partitioning that shares a single GPU among multiple processes by rapidly switching contexts. Unlimited partitions, no memory isolation, and no QoS guarantees.
MPS (Multi‑Process Service) : Logical partitioning with up to 48 partitions, provides memory protection and reduces context‑switch overhead. Supported on Kepler and newer architectures (compute capability > 3.5) and Linux only.
MIG (Multi‑Instance GPU) : Physical partitioning introduced with Nvidia Ampere, allowing up to 7 isolated GPU instances, each with dedicated memory, SMs and QoS guarantees.
Time‑Slicing
Multiple CUDA applications share a GPU by rapidly switching contexts. This incurs extra latency, lacks memory isolation, and can cause OOM errors when one process exhausts memory.
MPS
MPS aggregates multiple CUDA streams or processes into a single GPU context, improving utilization and reducing context‑switch overhead. It offers memory protection but is limited to Linux and requires compute capability > 3.5.
MIG
MIG splits a physical GPU into up to seven independent instances, each with its own memory, SM units and compute resources. It provides full memory isolation, QoS guarantees, and fault isolation, making it suitable for multi‑tenant environments and mixed workloads.
Sample Pod Specification for MPS
apiVersion: v1
kind: Pod
metadata:
name: mps-gpu-pod
spec:
restartPolicy: Never
hostIPC: true
securityContext:
runAsUser: 1000
containers:
- name: cuda-container
image: myrepo/cuda:latest
resources:
limits:
nvidia.com/mps-core: 50
nvidia.com/mps-memory: 8References
CUDA Toolkit documentation: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
Kubernetes Device Plugin guide: https://kubernetes.io/zh-cn/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
Nvidia MIG user guide: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
Nvidia MPS documentation: https://docs.nvidia.com/deploy/mps/index.html
Kubernetes scheduler framework: https://kubernetes.io/zh-cn/docs/concepts/scheduling-eviction/scheduling-framework/
Volcano scheduler documentation: https://volcano.sh/zh/docs/
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
