Cloud Computing 10 min read

Optimizing GPU Virtual Machine Instance Creation Time in Public Cloud Environments

DiDi Cloud reduced GPU VM provisioning latency by over 90% through kernel‑level pre‑zeroing of idle pages, transparent huge pages, optimized VFIO DMA mapping, and boot‑sequence streamlining, turning GPU instance creation faster than CPU‑only VMs and meeting strict latency demands.

Didi Tech
Didi Tech
Didi Tech
Optimizing GPU Virtual Machine Instance Creation Time in Public Cloud Environments

Creating a GPU virtual machine (VM) instance in public clouds is generally slower than creating a CPU-only VM. While most users consider VM creation a low‑frequency operation and tolerate the delay, certain latency‑sensitive workloads require fast provisioning. This article analyzes the reasons behind the long creation time of GPU VMs and presents a series of optimizations implemented by DiDi Cloud.

The analysis starts from three key timestamps collected from Libvirt, QEMU, and the guest kernel logs: (a) when Libvirt launches the QEMU process, (b) when Libvirt resumes the VCPU, and (c) when the guest kernel prints its first log line. The interval between (a) and (b) is defined as the QEMU initialization time, and the interval between (b) and (c) as the BIOS execution time. Measurements on a VM with 8 vCPUs show that adding a single P40 GPU card significantly increases both intervals compared to a CPU‑only VM, and the impact grows with more GPUs and larger memory configurations.

Further investigation using perf sampling and flame‑graphs reveals that the dominant hotspot lies in memory allocation and zero‑ing performed by the vfio_dma_map function, which handles the VFIO_IOMMU_MAP_DMA ioctl. This operation pins all RAM pages allocated to the VM; if a page has not yet been backed by physical memory, the kernel allocates it and clears it for security reasons. The amount of memory to be pinned and the size of each GPU’s MMIO region directly affect the total latency.

One proposed mitigation is to mark RAM pages that have already been cleared, allowing the kernel to skip the zero‑ing step and let QEMU perform it with multithreading and more efficient instructions. However, this approach introduces security concerns and yields limited benefits for VMs with few vCPUs.

DiDi Cloud adopted a different strategy: modify the host kernel’s memory management to pre‑zero idle physical pages during system idle periods. Cleared pages are flagged, and subsequent pinning can bypass the zero‑ing step, improving both page‑fault handling (especially for transparent huge pages) and workloads that require pinned memory such as RDMA or QAT acceleration. The corresponding patches have been submitted to the Linux kernel community.

Another effective technique is enabling transparent huge pages, which reduces the number of page‑faults and accelerates the pinning process. Experiments show that combining transparent huge pages with the pre‑zeroing mechanism dramatically shortens QEMU initialization time, although BIOS execution remains relatively long.

Additional optimizations target the VFIO DMA mapping path: batch processing of page‑table lookups and a management layer inside QEMU to avoid repeated mapping/unmapping of IOVA regions. These changes further cut the DMA mapping overhead.

Beyond the DMA path, the team streamlined other parts of the VM boot sequence, such as disabling the BIOS boot menu, optimizing the VFIO PCI device reset flow, and removing unnecessary steps for GPU instances. As a result, the total time spent in the virtualization layer for GPU VMs decreased by more than 90% for small‑memory, single‑GPU configurations and by over 95% for large‑memory, multi‑GPU setups.

After all optimizations, creating a GPU instance on DiDi Cloud is now faster than creating a comparable CPU instance, meeting the stringent latency requirements of demanding applications.

Note: DiDi Cloud offers a free enterprise trial of its GPU product; interested users can reply “GPU” to the official WeChat account to receive access.

performance optimizationcloud computingGPULinux kernelVirtualizationqemu
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.