Cloud Native 9 min read

How mGPU Enables Efficient GPU Sharing for AI Workloads in Cloud‑Native Environments

The article explains the mGPU solution from Volcano Engine, detailing its kernel‑level GPU virtualization, container runtime hooks, and scheduling mechanisms that allow multiple containers to share a single NVIDIA GPU with isolated compute and memory resources, achieving near‑lossless performance and up to 50% higher utilization for AI tasks.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
How mGPU Enables Efficient GPU Sharing for AI Workloads in Cloud‑Native Environments

AI large‑model emergence drives unprecedented demand for flexible, cost‑effective GPU resources. GPU sharing technology can dramatically improve utilization and flexibility for high‑performance computing.

mGPU is Volcano Engine’s container‑focused solution that virtualizes NVIDIA GPUs at the kernel level and provides a shared‑GPU framework. It enables multiple containers to share one GPU card while isolating compute and memory resources, reducing costs without sacrificing performance.

Technical Architecture

The overall architecture consists of three core components:

mGPU Kernel Module : intercepts container calls to the NVIDIA driver (open, close, mmap, ioctl, poll) to control compute and memory allocation.

mGPU Container Runtime Hook : hijacks the pre‑start hook of

nvidia‑container‑runtime

, parses container configuration from environment variables, and sends container creation requests to the mGPU Daemon via RPC.

mGPU Daemon : acts as an RPC server, registers containers through the kernel module’s ioctl interface.

mGPU architecture diagram
mGPU architecture diagram

Compute Isolation

GPU tasks are initialized by the GPU kernel driver, creating a push buffer (queue) that is encapsulated as a channel and grouped into a Time Slice Group (TSG). The scheduler selects channels from TSGs for execution, enabling fine‑grained time‑slice sharing. mGPU implements two schedulers:

Hardware‑time‑slice scheduler : intercepts ioctl calls that set hardware time slices and scales them proportionally.

Software‑time‑slice scheduler : creates a kernel thread per GPU, dynamically enables or disables channels based on configured compute weights, achieving precise QoS.

Memory Isolation

CUDA memory‑management APIs are unified into ioctl operations on the

nvidiactl

character device. mGPU creates a virtual GPU card for each container, intercepting allocation, release, and query requests. It enforces per‑container memory limits, returns OOM when exceeded, records allocation metadata, and forwards legitimate requests to the NVIDIA kernel driver.

GPU task flow diagram
GPU task flow diagram

Performance Evaluation

Using a V100/32GB server for ResNet‑50 inference, the test shows that enabling mGPU virtually eliminates performance loss even when GPU utilization is saturated.

Performance comparison chart
Performance comparison chart

Conclusion

mGPU, built on ByteDance’s cloud‑native experience, provides strong GPU isolation, fine‑grained scheduling, and near‑lossless performance, boosting resource utilization by over 50%. It helps enterprises meet the growing AI‑chip demand with cost‑effective, high‑performance compute.

cloud nativeResource IsolationAI workloadscontainer runtimeGPU sharing
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.