Cloud Native 9 min read

How mGPU Enables Efficient GPU Sharing for AI Workloads in Cloud‑Native Environments

The article explains the mGPU solution from Volcano Engine, detailing its kernel‑level GPU virtualization, container runtime hooks, and scheduling mechanisms that allow multiple containers to share a single NVIDIA GPU with isolated compute and memory resources, achieving near‑lossless performance and up to 50% higher utilization for AI tasks.

ByteDance Cloud Native

Aug 12, 2024

How mGPU Enables Efficient GPU Sharing for AI Workloads in Cloud‑Native Environments

AI large‑model emergence drives unprecedented demand for flexible, cost‑effective GPU resources. GPU sharing technology can dramatically improve utilization and flexibility for high‑performance computing.

mGPU is Volcano Engine’s container‑focused solution that virtualizes NVIDIA GPUs at the kernel level and provides a shared‑GPU framework. It enables multiple containers to share one GPU card while isolating compute and memory resources, reducing costs without sacrificing performance.

Technical Architecture

The overall architecture consists of three core components:

mGPU Kernel Module : intercepts container calls to the NVIDIA driver (open, close, mmap, ioctl, poll) to control compute and memory allocation.

mGPU Container Runtime Hook : hijacks the pre‑start hook of nvidia‑container‑runtime, parses container configuration from environment variables, and sends container creation requests to the mGPU Daemon via RPC.

mGPU Daemon : acts as an RPC server, registers containers through the kernel module’s ioctl interface.

Compute Isolation

GPU tasks are initialized by the GPU kernel driver, creating a push buffer (queue) that is encapsulated as a channel and grouped into a Time Slice Group (TSG). The scheduler selects channels from TSGs for execution, enabling fine‑grained time‑slice sharing. mGPU implements two schedulers:

Hardware‑time‑slice scheduler : intercepts ioctl calls that set hardware time slices and scales them proportionally.

Software‑time‑slice scheduler : creates a kernel thread per GPU, dynamically enables or disables channels based on configured compute weights, achieving precise QoS.

Memory Isolation

CUDA memory‑management APIs are unified into ioctl operations on the nvidiactl character device. mGPU creates a virtual GPU card for each container, intercepting allocation, release, and query requests. It enforces per‑container memory limits, returns OOM when exceeded, records allocation metadata, and forwards legitimate requests to the NVIDIA kernel driver.

Performance Evaluation

Using a V100/32GB server for ResNet‑50 inference, the test shows that enabling mGPU virtually eliminates performance loss even when GPU utilization is saturated.

Conclusion

mGPU, built on ByteDance’s cloud‑native experience, provides strong GPU isolation, fine‑grained scheduling, and near‑lossless performance, boosting resource utilization by over 50%. It helps enterprises meet the growing AI‑chip demand with cost‑effective, high‑performance compute.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI workloads container runtime GPU Sharing

Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.