Artificial Intelligence 30 min read

GPU Container Virtualization for AI Heterogeneous Computing: Architecture and Best Practices

The article surveys GPU container virtualization for AI heterogeneous computing, detailing utilization challenges, historical architectures, various virtualization methods, Baidu's dual-engine user- and kernel-space design with isolation and scheduling features, performance benefits, best‑practice scenarios, and deployment guidance, concluding with a technical Q&A.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
GPU Container Virtualization for AI Heterogeneous Computing: Architecture and Best Practices

This article provides a comprehensive overview of GPU container virtualization technology for AI heterogeneous computing, covering challenges, architecture, implementation details, and best practices. The content is based on InfoQ's Open Class and includes Q&A at the end.

The article begins by highlighting the growing demand for AI computing power, with model training requirements doubling every 3.4 months since 2012, while actual resource utilization in production environments remains below 30%. It identifies key constraints affecting GPU utilization including model characteristics, service SLA requirements, traffic patterns, optimization levels, and capacity redundancy.

The article then presents four typical utilization patterns observed in production: low average utilization, peak-valley fluctuations, short-term spikes, and periodic triggering. These patterns demonstrate the complexity of AI application scenarios and the need for flexible virtualization solutions.

A historical overview of GPU virtualization development is provided, tracing from early G80 Tesla architecture through Kepler, Pascal, Volta, Turing, and Ampere architectures. The article discusses various virtualization approaches including API hooking (rCUDA), hardware-based solutions (NVIDIA GRID vGPU, MIG), and software-based implementations.

The core of the article focuses on Baidu's dual-engine GPU container virtualization architecture, which combines user-space and kernel-space isolation engines. The user-space engine uses API hooking to intercept CUDA calls and provide features like memory isolation, compute isolation, encoding/decoding isolation, priority preemption, memory oversubscription, and memory pooling. The kernel-space engine implements isolation through system call interception and provides memory isolation, compute isolation, and multiple scheduling algorithms (Fixed Share, Equal Share, Weight Share, Burst Weight Share).

Performance evaluation shows that user-space virtualization with process fusion achieves superior tail latency compared to bare-metal and kernel-space approaches, particularly under high load. The article also discusses advanced features like remote GPU access, MPS (Multi-Process Service) optimization, priority preemption for online/offline task mixing, and time-sharing with memory swapping.

Best practices are presented for three common scenarios: shared mixing for low-utilization tasks, priority preemption for fluctuating workloads with short spikes, and time-sharing with memory swapping for intermittent compute tasks. The article concludes by mentioning that all these technologies are available on Baidu's AI heterogeneous computing platform (Baidu Baige) and can be deployed in both public and private clouds.

The Q&A section addresses technical questions about resource control mechanisms, NPU virtualization, coexistence of different virtualization approaches, scheduling extensions, and deployment requirements.

cloud nativeresource optimizationContainerizationperformance evaluationGPU virtualizationmemory isolationheterogeneous computingAI computingMPScompute isolation
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.