Industry Insights 14 min read

How GPU Virtualization Works: From User‑Space APIs to Hardware Isolation

This article explains why GPU virtualization is needed, compares resource‑sharing and isolation approaches, and details user‑level API interception, remote API forwarding, half‑virtualization with virtio, kernel‑level driver interception, and hardware‑level solutions such as vGPU, MIG, and AMD MxGPU.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
How GPU Virtualization Works: From User‑Space APIs to Hardware Isolation

GPUs are traditionally used for graphics rendering on PCs and gaming consoles, but they also serve high‑performance computing (GPGPU) and codec acceleration. Because a single GPU can be over‑provisioned, virtualizing GPU resources enables sharing and isolation across multiple workloads.

Why GPU Virtualization Is Needed

Resource sharing: Prevent waste of excessive GPU performance.

Resource isolation: Separate video memory and compute power for different tenants.

Isolation and Application Scenarios

Isolation contexts: containers, virtual machines.

Application contexts: virtual desktops, rendering farms, AI model training.

User‑Space Virtualization

Implementation relies on local API interception and forwarding:

Inject a user‑space function that mimics the underlying library API.

Use a libwrapper to intercept all function calls, parse parameters, invoke the real library (statically or dynamically linked), and return results to the application.

Static linking copies required libraries into the executable, increasing size but removing runtime dependencies; dynamic linking keeps the executable small but requires external shared libraries.

Remote API Forwarding

Remote forwarding enables GPU pooling, allowing machines without physical GPUs to access GPU functionality via the network. The typical flow is:

Client application (in a VM) calls a predefined API (e.g., REST).

The request reaches the client OS interface module.

The interface module packages and encodes the request, sending it over the network to the virtualization manager.

The manager (usually on the host) decodes, parses, and executes the corresponding action.

After execution, the manager encodes the response and sends it back.

The client OS module decodes the response and returns the result to the original application.

This method adds network overhead compared with semi‑virtualization but requires no changes to the guest OS and is language‑agnostic.

Semi‑Virtual API Forwarding

In this model, the application and libwrapper run inside the VM, while a virtio front‑end is implemented in the guest kernel and a virtio back‑end runs on the host. Communication occurs via shared memory, reducing data copies. The steps are:

Guest OS issues a hypercall or VMCALL.

The paravirtualized interface layer in the guest intercepts the call and translates parameters.

The call triggers a VM exit, delivering the request to the hypervisor.

The hypervisor processes the request and returns a response.

The guest interface layer translates the response back for the application.

This approach offers faster response times because it avoids network latency, but it requires modifications to the guest OS.

Kernel‑Level Virtualization

Typical driver flow: application → user‑space library → kernel driver → hardware. Intercepting at the kernel level involves hooking the device file used by the driver. A kernel module can capture calls, parse parameters, and optionally isolate memory allocation or compute resources.

Kernel interception is well‑suited for container environments (shared host kernel) but more complex for VMs, where each VM has its own kernel.

Hardware Virtualization Techniques

CPU virtualization and IOMMU provide the foundation for GPU isolation.

Full virtualization / GPU passthrough: Directly assign a whole GPU to a VM with minimal performance loss, but no sharing.

NVIDIA vGPU (software slicing): Shares a GPU among VMs, incurs higher overhead and licensing costs; memory is statically partitioned.

NVIDIA MIG (hardware slicing): Divides a GPU into multiple hardware‑isolated instances; each MIG instance appears as an independent GPU and can be combined with containers.

NVIDIA MIG vGPU: Adds a software layer on top of MIG to allow multiple VMs to share a MIG instance, offering flexibility at the cost of extra overhead.

AMD MxGPU (SR‑IOV): Splits the GPU into Physical Functions (PF) and Virtual Functions (VF). PF controls the hardware, while VF provides virtualized access to VMs, enabling direct GPU usage without full hypervisor mediation.

Comparison of GPU Virtualization Approaches

The following diagram (originally included) summarizes the trade‑offs among semi‑virtualization, remote API forwarding, full virtualization, and hardware slicing solutions.

Overall, the choice of GPU virtualization method depends on performance requirements, desired level of resource sharing, and the target deployment environment (containers vs. VMs).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ContainerGPU virtualizationvGPUhypervisorvirtual desktopMIGkernel driver interceptionremote API forwarding
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.