How mGPU Enables Efficient GPU Sharing for AI Workloads in Cloud‑Native Environments
The article explains the mGPU solution from Volcano Engine, detailing its kernel‑level GPU virtualization, container runtime hooks, and scheduling mechanisms that allow multiple containers to share a single NVIDIA GPU with isolated compute and memory resources, achieving near‑lossless performance and up to 50% higher utilization for AI tasks.
AI large‑model emergence drives unprecedented demand for flexible, cost‑effective GPU resources. GPU sharing technology can dramatically improve utilization and flexibility for high‑performance computing.
mGPU is Volcano Engine’s container‑focused solution that virtualizes NVIDIA GPUs at the kernel level and provides a shared‑GPU framework. It enables multiple containers to share one GPU card while isolating compute and memory resources, reducing costs without sacrificing performance.
Technical Architecture
The overall architecture consists of three core components:
mGPU Kernel Module : intercepts container calls to the NVIDIA driver (open, close, mmap, ioctl, poll) to control compute and memory allocation.
mGPU Container Runtime Hook : hijacks the pre‑start hook of
nvidia‑container‑runtime, parses container configuration from environment variables, and sends container creation requests to the mGPU Daemon via RPC.
mGPU Daemon : acts as an RPC server, registers containers through the kernel module’s ioctl interface.
Compute Isolation
GPU tasks are initialized by the GPU kernel driver, creating a push buffer (queue) that is encapsulated as a channel and grouped into a Time Slice Group (TSG). The scheduler selects channels from TSGs for execution, enabling fine‑grained time‑slice sharing. mGPU implements two schedulers:
Hardware‑time‑slice scheduler : intercepts ioctl calls that set hardware time slices and scales them proportionally.
Software‑time‑slice scheduler : creates a kernel thread per GPU, dynamically enables or disables channels based on configured compute weights, achieving precise QoS.
Memory Isolation
CUDA memory‑management APIs are unified into ioctl operations on the
nvidiactlcharacter device. mGPU creates a virtual GPU card for each container, intercepting allocation, release, and query requests. It enforces per‑container memory limits, returns OOM when exceeded, records allocation metadata, and forwards legitimate requests to the NVIDIA kernel driver.
Performance Evaluation
Using a V100/32GB server for ResNet‑50 inference, the test shows that enabling mGPU virtually eliminates performance loss even when GPU utilization is saturated.
Conclusion
mGPU, built on ByteDance’s cloud‑native experience, provides strong GPU isolation, fine‑grained scheduling, and near‑lossless performance, boosting resource utilization by over 50%. It helps enterprises meet the growing AI‑chip demand with cost‑effective, high‑performance compute.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.