How mGPU Enables Efficient GPU Sharing for AI Workloads
This article explains the mGPU solution that virtualizes NVIDIA GPUs for containers, detailing its driver architecture, compute and memory isolation mechanisms, performance benchmarks on ResNet‑50 inference, and how it boosts GPU utilization by over 50% for AI and high‑performance computing tasks.
Introduction
The rise of large AI models pushes the limits of computing resources, demanding flexible and cost‑effective GPU utilization. mGPU, a container‑level GPU sharing solution from Volcano Engine, enables multiple containers to share a single GPU with fine‑grained compute and memory scheduling while maintaining strict isolation.
Technical Architecture
mGPU consists of a kernel module, a container runtime hook, and a daemon. The kernel module intercepts container calls to the NVIDIA driver (open, close, mmap, ioctl, poll) to control compute and memory resources. The runtime hook captures pre‑start hook calls from the NVIDIA container runtime, extracts configuration from environment variables, and sends container creation requests to the daemon via RPC. The daemon registers containers through the kernel module’s ioctl interface.
Component Overview
1. mGPU Kernel Module intercepts container interactions with the NVIDIA driver to enforce compute and memory control.
2. mGPU Container Runtime Hook hijacks the nvidia‑container‑runtime‑hook/pre‑start call, parses container GPU configuration, and forwards a creation request to the mGPU daemon.
3. mGPU Daemon acts as an RPC server, receiving container creation requests and registering containers through the kernel module’s ioctl interface.
Implementation Principles
Compute Isolation
GPU tasks are scheduled via a push‑buffer (queue) that forms channels grouped into Time Slice Groups (TSG). The hardware scheduler selects channels based on TSG time slices, enabling time‑slice sharing among tasks. mGPU implements two schedulers:
Hardware‑time‑slice scheduler : intercepts ioctl calls that set hardware time slices, scales them proportionally, and forwards the adjusted parameters to the native driver.
Software‑time‑slice scheduler : creates a kernel thread per GPU, dynamically enables or disables container channels according to assigned compute weights, achieving precise QoS.
Memory Isolation
CUDA memory management APIs are funneled through the nvidiactl character device. mGPU creates a virtual GPU card for each container, intercepting allocation, release, and query requests in the kernel module:
If an allocation exceeds the container’s quota, OOM is returned; otherwise the allocation is recorded and forwarded to the NVIDIA driver.
On free, the module releases the recorded memory and forwards the request.
On query, the module returns memory usage limited to the container’s isolation boundaries.
Performance Evaluation
Using a V100/32 GB server for ResNet‑50 inference, the performance impact of enabling mGPU is negligible; GPU load reaches saturation with almost no loss of throughput.
Conclusion
Generative AI drives a surge in demand for high‑performance AI chips. Shared GPU technology like mGPU can increase resource utilization by more than 50% while providing stable, cost‑effective compute, helping enterprises build a robust, cloud‑native heterogeneous computing ecosystem for the AI era.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.