Cloud Native 10 min read

How mGPU Enables Efficient GPU Sharing for AI Workloads

This article explains the mGPU solution that virtualizes NVIDIA GPUs for containers, detailing its driver architecture, compute and memory isolation mechanisms, performance benchmarks on ResNet‑50 inference, and how it boosts GPU utilization by over 50% for AI and high‑performance computing tasks.

ByteDance SYS Tech

Aug 12, 2024

How mGPU Enables Efficient GPU Sharing for AI Workloads

Introduction

The rise of large AI models pushes the limits of computing resources, demanding flexible and cost‑effective GPU utilization. mGPU, a container‑level GPU sharing solution from Volcano Engine, enables multiple containers to share a single GPU with fine‑grained compute and memory scheduling while maintaining strict isolation.

Technical Architecture

mGPU consists of a kernel module, a container runtime hook, and a daemon. The kernel module intercepts container calls to the NVIDIA driver (open, close, mmap, ioctl, poll) to control compute and memory resources. The runtime hook captures pre‑start hook calls from the NVIDIA container runtime, extracts configuration from environment variables, and sends container creation requests to the daemon via RPC. The daemon registers containers through the kernel module’s ioctl interface.

Component Overview

1. mGPU Kernel Module intercepts container interactions with the NVIDIA driver to enforce compute and memory control.

2. mGPU Container Runtime Hook hijacks the nvidia‑container‑runtime‑hook/pre‑start call, parses container GPU configuration, and forwards a creation request to the mGPU daemon.

3. mGPU Daemon acts as an RPC server, receiving container creation requests and registering containers through the kernel module’s ioctl interface.

Implementation Principles

Compute Isolation

GPU tasks are scheduled via a push‑buffer (queue) that forms channels grouped into Time Slice Groups (TSG). The hardware scheduler selects channels based on TSG time slices, enabling time‑slice sharing among tasks. mGPU implements two schedulers:

Hardware‑time‑slice scheduler : intercepts ioctl calls that set hardware time slices, scales them proportionally, and forwards the adjusted parameters to the native driver.

Software‑time‑slice scheduler : creates a kernel thread per GPU, dynamically enables or disables container channels according to assigned compute weights, achieving precise QoS.

Memory Isolation

CUDA memory management APIs are funneled through the nvidiactl character device. mGPU creates a virtual GPU card for each container, intercepting allocation, release, and query requests in the kernel module:

If an allocation exceeds the container’s quota, OOM is returned; otherwise the allocation is recorded and forwarded to the NVIDIA driver.

On free, the module releases the recorded memory and forwards the request.

On query, the module returns memory usage limited to the container’s isolation boundaries.

Performance Evaluation

Using a V100/32 GB server for ResNet‑50 inference, the performance impact of enabling mGPU is negligible; GPU load reaches saturation with almost no loss of throughput.

Conclusion

Generative AI drives a surge in demand for high‑performance AI chips. Shared GPU technology like mGPU can increase resource utilization by more than 50% while providing stable, cost‑effective compute, helping enterprises build a robust, cloud‑native heterogeneous computing ecosystem for the AI era.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native AI acceleration Container Orchestration Resource Isolation GPU Sharing mGPU

Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.