Cloud Native 5 min read

Survey of GPU Sharing and Virtualization Solutions for Kubernetes

The article surveys open‑source GPU sharing and virtualization approaches for AI workloads, comparing soft isolation, CUDA‑level isolation, NVIDIA MPS, driver‑level isolation, GPU pooling and deep‑learning memory sharing, and highlights their architectures, isolation guarantees, and performance trade‑offs.

Infra Learning Club

Sep 16, 2024

Survey of GPU Sharing and Virtualization Solutions for Kubernetes

AI workloads frequently rely on GPUs, which are considerably more expensive than CPU or memory resources. Implementing QoS‑based GPU sharing/virtualization that provides fault, memory, and compute isolation while maintaining application performance is therefore a critical differentiator for multi‑tenant clusters.

Several open‑source solutions are currently available, generally falling into five categories:

Soft isolation (no true isolation, multiple Pods per GPU): Alibaba Cloud gpushare-scheduler-extender and gpushare-device-plugin, NVIDIA Time‑Slicing.

CUDA‑layer isolation (vcuda): Tencent tkestack vcuda‑controller (CUDA wrapper), gpu‑manager (device plugin), gpu‑admission (scheduler extender), and HAMI.

NVIDIA MPS: NVIDIA Multi‑Process Service.

Driver‑level isolation: Alibaba Cloud cGPU, Tencent Cloud qGPU, Volcano Engine mGPU.

GPU pooling: DriverTech GPU pooling.

Deep‑learning shared memory: Ant Deep‑Learning shared‑memory approach.

These solutions share a similar architecture: a scheduler extender plus a device plugin. By adding a new GPU resource type, the scheduler maintains a GPU allocation metric to drive placement decisions.

Key characteristics observed:

Soft‑isolation schemes lack isolation; they simply allow multiple Pods to attach to a GPU, leaving over‑commit checks to the applications.

vcuda provides isolation by intercepting CUDA APIs, but measured latency is higher for inference workloads and the implementation must track CUDA API changes.

NVIDIA MPS offers better raw performance, yet it does not provide fault isolation, and community research on extending it for isolation is limited.

Driver‑level isolation (cGPU, qGPU, mGPU) can improve performance compared with earlier approaches, but they are currently only usable on public‑cloud offerings and cannot increase on‑premise cluster GPU utilization.

The Ant deep‑learning shared‑memory method integrates with framework runtimes but lacks a standardized interface.

References:

gpushare‑scheduler‑extender: https://github.com/AliyunContainerService/gpushare-scheduler-extender

gpushare‑device‑plugin: https://github.com/AliyunContainerService/gpushare-device-plugin

NVIDIA Time‑Slicing: https://github.com/NVIDIA/k8s-device-plugin#shared-access-to-gpus-with-cuda-time-slicing

vcuda‑controller: https://github.com/tkestack/vcuda-controller

gpu‑manager: https://github.com/tkestack/gpu-manager

gpu‑admission: https://github.com/tkestack/gpu-admission

HAMI: https://github.com/Project-HAMi/HAMi

NVIDIA MPS: https://docs.nvidia.com/deploy/mps/

cGPU: https://developer.aliyun.com/article/771984

qGPU: https://cloud.tencent.com/developer/article/1831090

mGPU: https://www.volcengine.com/docs/6460/159262

DriverTech GPU pooling: https://virtaitech.com/product.pdf

Deep‑learning sharing (USENIX OSDI20): https://www.usenix.org/conference/osdi20/presentation/xiao

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes GPU Virtualization Device Plugin Scheduler Extender

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.