Artificial Intelligence 16 min read

GPU Virtual Sharing for AI Inference Services on Kubernetes

The article presents a GPU virtual‑sharing solution for AI inference workloads that isolates memory and compute resources via CUDA API interception, integrates with Kubernetes using the open‑source aliyun‑gpushare scheduler, and demonstrates doubled GPU utilization and minimal performance loss across multiple tests.

DataFunTalk
DataFunTalk
DataFunTalk
GPU Virtual Sharing for AI Inference Services on Kubernetes

With the rapid growth of AI services at iQIYI, many online inference containers require exclusive GPU access, leading to low GPU utilization because each request runs in isolation and cannot be batched. The article first outlines the inefficiencies of current GPU usage and the need for better sharing mechanisms.

Two official Nvidia sharing technologies are introduced: vGPU (SR‑IOV based, requiring VM layers and a license, with limited re‑configuration) and MPS (software‑based, flexible but vulnerable to a single‑point failure). Both have drawbacks for container‑native AI workloads.

To address these issues, iQIYI engineers built a custom container‑level GPU sharing solution. By intercepting CUDA APIs (e.g., cuDeviceTotalMem , cuMemGetInfo , cuMemAlloc , cuMemFree ) via LD_PRELOAD, they enforce per‑process memory quotas and prevent over‑allocation. The same interception also enables compute‑share isolation by dynamically adjusting kernel launch parameters and “trapping” kernels on a subset of SMs, effectively partitioning GPU compute power.

Performance tests show that single‑process compute isolation retains >90% of native performance at 100%, 50%, and 10% compute allocations, and that multi‑process interference is negligible when each process receives a proportional share (e.g., 50%/50% or 70%/30%). After deployment, average GPU utilization more than doubled, the number of services per GPU increased from 1 to 3, and over 100 AI containers now share 35 physical GPUs without mutual impact.

The underlying isolation mechanisms are detailed: memory isolation separates CUDA kernel context, model parameters, and intermediate buffers; compute isolation limits the number of SMs a kernel can occupy by modifying block sizes and inserting custom branching logic in the kernel’s assembly to keep it confined to the allocated SM subset.

For scheduling, the solution integrates with Kubernetes using the open‑source aliyun‑gpushare extender. New components—Share GPU Device Plugin (SGDP) and Share GPU Scheduler Extender (SGSE)—handle custom resource requests (e.g., aliyun.com/gpu-mem ), node selection, pod annotation patching, and environment‑variable injection (including LD_PRELOAD ) to enforce the quotas at runtime.

In practice, a single V100 GPU can be split into 1/4 or 1/2 memory/compute partitions, allowing up to four independent applications per card with strong isolation. The article concludes with future work on cross‑host GPU sharing to address CPU/GPU imbalance in multi‑GPU nodes.

KubernetesCUDANvidiaResource SchedulingGPU virtualizationdeep learning inference
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.