Industry Insights 25 min read

How GPU Virtualization Powers AI and Cloud Computing: Techniques, Challenges, and Future Directions

This article examines the rapid rise of GPU virtualization as a solution for efficient GPU resource utilization in AI, big data, and high‑performance computing, detailing its concepts, implementation methods across user, kernel, and hardware layers, Kubernetes integration, real‑world use cases, challenges, and emerging research trends.

AsiaInfo Technology: New Tech Exploration

Aug 30, 2024

How GPU Virtualization Powers AI and Cloud Computing: Techniques, Challenges, and Future Directions

Introduction

With the rapid development of artificial intelligence, big‑data analytics, deep learning, and high‑performance computing, the demand for compute‑intensive tasks has surged, making GPU resources essential. However, GPUs are expensive and complex to manage, especially in cloud environments. GPU virtualization technology addresses these challenges by enabling multiple applications to share physical GPUs, improving utilization, reducing hardware costs, and providing flexible scheduling.

GPU Virtualization Basics

GPU virtualization abstracts a physical GPU into multiple logical GPUs, allowing virtual machines or containers to share a single device. Main approaches include:

Pass‑through (Direct Assignment) : The entire GPU is assigned to one VM or container, offering high performance but no sharing.

Sharing : The GPU is partitioned into logical units for concurrent use, suitable for workloads with moderate performance needs.

Full Virtualization : Software emulates GPU hardware, enabling isolated virtual GPUs at the cost of performance.

GPU Pooling : Multiple GPUs are managed as a unified pool, allowing dynamic allocation and scaling.

Development History

GPU virtualization has evolved from simple partitioning to arbitrary slicing, remote invocation, and resource pooling, each step improving utilization and flexibility.

Application Scenarios

Cloud computing – flexible, high‑performance GPU resources for AI training and inference.

Deep learning – accelerated model training and inference.

Data analysis – faster processing of large datasets.

Graphics rendering – high‑performance rendering for VR and visual effects.

Key Technologies

1. User‑Space Virtualization

API Interception and Forwarding : A user‑space library (e.g., libwrapper) intercepts application calls to the GPU driver, forwards them to the real driver, and returns results. Steps include application calling libwrapper, intercepting and parsing arguments, dynamically loading the underlying library with dlopen, invoking the original function, and returning the result.

Remote API Forwarding : Enables GPU calls to be forwarded to a remote machine, allowing GPU resource pooling across hosts. The system consists of a client that forwards requests over the network and a server that executes the calls on the physical GPU.

Resource Pooling : Combines multiple GPUs into a single pool with dynamic scheduling and API interfaces for Software‑Defined Data Center (SDDC) integration.

2. Kernel‑Space Virtualization

GPU Driver Interception : A kernel module creates a virtual device (e.g., /dev/fakegpu) that intercepts driver calls, enabling containerized applications to use the GPU without modification.

Para‑Virtualization : Uses a hypervisor to intercept GPU driver calls and forward them to the host, allowing VMs to share GPU resources with reduced overhead.

3. Hardware‑Space Virtualization

Requires hardware support such as CPU virtualization extensions (Intel VT‑x, AMD‑V, ARM VHE, RISC‑V H‑Extension), IOMMU (Intel VT‑d, AMD IOMMU, ARM SMMU), and NVIDIA technologies like SR‑IOV, Multi‑Instance GPU (MIG), and MIG‑vGPU.

Full Virtualization / Pass‑Through GPU

Pass‑through assigns the entire GPU directly to a VM, providing near‑native performance but no sharing. It is ideal for workloads demanding maximum compute power.

NVIDIA Virtual GPU (vGPU)

vGPU divides a physical GPU into multiple virtual GPUs, each with allocated compute and memory resources. It is suited for cloud and virtual desktop scenarios.

NVIDIA MIG (Multi‑Instance GPU)

MIG splits a GPU into up to seven independent instances, each with dedicated compute, memory, and PCIe bandwidth, enabling fine‑grained sharing for multi‑tenant environments.

MIG‑vGPU

Combines MIG instances with vGPU to provide flexible, isolated GPU resources with improved performance compared to traditional vGPU.

Technology Comparison

Tables compare user‑space, kernel‑space, and hardware‑space virtualization techniques, as well as industry solutions from NVIDIA, AMD, and others.

GPU Virtualization in Containers with Kubernetes

Kubernetes manages GPU resources via the DevicePlugin framework. The plugin reports available GPUs to the kubelet using gRPC, which updates the API server. When a pod requests a GPU, the scheduler selects a node, and the kubelet calls the plugin’s Allocate method to bind the GPU device to the container.

Interaction diagram (image omitted) shows the flow between DevicePlugin, kubelet, and the container runtime.

Kubernetes GPU Management with NVIDIA GPU Operator

The NVIDIA GPU Operator automates deployment of drivers, runtimes, and monitoring components. It supports three usage modes:

Full GPU (Dedicated Card) : Assigns an entire GPU to a pod using the nvidia.com/gpu resource.

vGPU : Uses virtual GPUs with configurable scaling parameters:

deviceCoreScaling : Ratio of GPU compute allocated per vGPU (default 1, can exceed 1 for over‑commit).

deviceMemoryScaling : Ratio of GPU memory allocated per vGPU (default 1, can exceed 1).

deviceSplitCount : Number of virtual GPUs that can be created from a single physical GPU (default 10).

Additional resources such as Resources and ServiceMonitor can be enabled for monitoring via the Insight Agent.

Demo Configuration

Example YAML snippets illustrate how to request full‑GPU or vGPU resources in a pod specification, including the nvidia.com/gpu and nvidia.com/vgpu extended resources and the corresponding limits.

Current Technical Challenges

Compatibility : Different GPU vendors have varying drivers, APIs, and hardware architectures, making unified management difficult.

Resource Allocation and Scheduling : Achieving dynamic, demand‑driven allocation while maintaining high utilization and performance in multi‑tenant environments.

Isolation and Security : Ensuring strong isolation between tenants to prevent resource contention and data leakage.

Future Directions and Research Hotspots

Unified Heterogeneous Resource Management : Develop platforms that manage CPUs, GPUs, FPGAs, and other accelerators uniformly.

Multi‑Tenant Isolation and Sharing : Enhance virtualization techniques for secure, efficient sharing among multiple users.

GPU Pooling Technologies : Advance pooling mechanisms for higher utilization and flexible scheduling.

Continued research in GPU virtualization will enable enterprises to deploy AI, big‑data, and high‑performance workloads more efficiently, supporting digital transformation and innovation.

cloud computing Kubernetes NVIDIA GPU virtualization Device Plugin

Written by

AsiaInfo Technology: New Tech Exploration

AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.