iQIYI GPU Virtual Sharing for AI Inference: Architecture, Isolation, and Scheduling
iQIYI created a custom GPU‑virtual‑sharing system that intercepts CUDA calls to enforce per‑container memory limits, rewrites kernel launches for compute isolation, and integrates with a Kubernetes scheduler extender, allowing multiple AI inference containers to share a single V100 with minimal overhead and more than doubling overall GPU utilization.
With the rapid development of artificial‑intelligence technologies, iQIYI’s online services increasingly rely on deep‑learning models. Each container instance traditionally occupies a whole GPU to meet millisecond‑level latency, which leads to low GPU utilization because requests cannot be batched and arrive at random intervals. Figure 1 shows the typical utilization curve with pronounced peaks and troughs.
The most direct remedy is to share a single GPU among multiple services. NVIDIA provides two official sharing mechanisms: vGPU (virtual GPU based on SR‑IOV) and MPS (Multi‑Process Service). The article briefly compares the two: vGPU requires a license, cannot change partitions without a GPU reboot, and needs a VM layer; MPS is more flexible, integrates well with Docker, but a failure of the MPS daemon affects all processes on the GPU.
To better fit iQIYI’s container‑centric AI workloads, a custom GPU‑virtual‑sharing solution was built. It intercepts CUDA APIs (e.g., cuDeviceTotalMem , cuMemGetInfo , cuMemAlloc , cuMemFree ) via LD_PRELOAD to enforce per‑container memory quotas and to return “out‑of‑memory” errors when the quota is exceeded. This enables TensorFlow and other frameworks to consume only the memory they are allocated, without requiring user‑side configuration changes.
For compute isolation, the solution dynamically rewrites kernel launch parameters. By changing the block size from <<<15,1>>> to a smaller value (e.g., 5), the kernel is confined to a subset of SMs, effectively allocating a configurable fraction of the GPU’s compute power. Since blockIdx and threadIdx are read‑only, the implementation replaces them with writable registers and inserts a branch at the kernel’s exit point to loop back with updated indices, as illustrated in Figures 7 and 8.
After the isolation layer was completed, a Kubernetes‑level scheduler was added using the open‑source aliyun‑gpushare extender (SGDP/SGSE). Pods request a custom resource aliyun.com/gpu-mem , the extender patches the pod with the target GPU and sets environment variables ( ALIYUN_COM_GPU_MEM_CONTAINER , LD_PRELOAD ) so that the container‑side interceptors enforce the quotas. Memory and compute resources are thus sliced per‑pod, allowing up to four different applications to share a single V100 card.
Performance tests (Figure 2) demonstrate that the isolation incurs minimal overhead: single‑process tests show <10 % loss at 100 % allocation, and multi‑process interference is negligible when each process receives 50 % or 70 % of the compute power. In production, more than 100 deep‑learning services now run on 35 physical GPUs, increasing the average number of services per GPU from 1 to 3 and boosting overall GPU utilization by over 2×.
The paper concludes with future work on cross‑host GPU remote invocation to address CPU/GPU imbalance in multi‑GPU machines.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.