Survey of GPU Sharing and Virtualization Solutions for Kubernetes
The article surveys open‑source GPU sharing and virtualization approaches for AI workloads, comparing soft isolation, CUDA‑level isolation, NVIDIA MPS, driver‑level isolation, GPU pooling and deep‑learning memory sharing, and highlights their architectures, isolation guarantees, and performance trade‑offs.
AI workloads frequently rely on GPUs, which are considerably more expensive than CPU or memory resources. Implementing QoS‑based GPU sharing/virtualization that provides fault, memory, and compute isolation while maintaining application performance is therefore a critical differentiator for multi‑tenant clusters.
Several open‑source solutions are currently available, generally falling into five categories:
Soft isolation (no true isolation, multiple Pods per GPU): Alibaba Cloud gpushare-scheduler-extender and gpushare-device-plugin, NVIDIA Time‑Slicing.
CUDA‑layer isolation (vcuda): Tencent tkestack vcuda‑controller (CUDA wrapper), gpu‑manager (device plugin), gpu‑admission (scheduler extender), and HAMI.
NVIDIA MPS: NVIDIA Multi‑Process Service.
Driver‑level isolation: Alibaba Cloud cGPU, Tencent Cloud qGPU, Volcano Engine mGPU.
GPU pooling: DriverTech GPU pooling.
Deep‑learning shared memory: Ant Deep‑Learning shared‑memory approach.
These solutions share a similar architecture: a scheduler extender plus a device plugin. By adding a new GPU resource type, the scheduler maintains a GPU allocation metric to drive placement decisions.
Key characteristics observed:
Soft‑isolation schemes lack isolation; they simply allow multiple Pods to attach to a GPU, leaving over‑commit checks to the applications.
vcuda provides isolation by intercepting CUDA APIs, but measured latency is higher for inference workloads and the implementation must track CUDA API changes.
NVIDIA MPS offers better raw performance, yet it does not provide fault isolation, and community research on extending it for isolation is limited.
Driver‑level isolation (cGPU, qGPU, mGPU) can improve performance compared with earlier approaches, but they are currently only usable on public‑cloud offerings and cannot increase on‑premise cluster GPU utilization.
The Ant deep‑learning shared‑memory method integrates with framework runtimes but lacks a standardized interface.
References:
gpushare‑scheduler‑extender: https://github.com/AliyunContainerService/gpushare-scheduler-extender
gpushare‑device‑plugin: https://github.com/AliyunContainerService/gpushare-device-plugin
NVIDIA Time‑Slicing: https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing
vcuda‑controller: https://github.com/tkestack/vcuda-controller
gpu‑manager: https://github.com/tkestack/gpu-manager
gpu‑admission: https://github.com/tkestack/gpu-admission
HAMI: https://github.com/Project-HAMi/HAMi
NVIDIA MPS: https://docs.nvidia.com/deploy/mps/
cGPU: https://developer.aliyun.com/article/771984
qGPU: https://cloud.tencent.com/developer/article/1831090
mGPU: https://www.volcengine.com/docs/6460/159262
DriverTech GPU pooling: https://virtaitech.com/product.pdf
Deep‑learning sharing (USENIX OSDI20): https://www.usenix.org/conference/osdi20/presentation/xiao
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
