Cloud Computing 14 min read

Achieving 5× Faster KVM Hot‑Upgrade with Transparent PCI Passthrough

This article reviews ByteDance's STE team's breakthrough that reduces KVM hot‑upgrade downtime from 1000 ms to 160 ms by transparently supporting PCI passthrough devices, detailing the underlying challenges, three key kernel/hypervisor improvements, proof‑of‑concept results, and their impact on cloud IaaS operations.

ByteDance SYS Tech
ByteDance SYS Tech
ByteDance SYS Tech
Achieving 5× Faster KVM Hot‑Upgrade with Transparent PCI Passthrough

Background

As one of the most important cloud‑computing foundations, KVM virtualization is widely deployed in modern data centers. Cloud service providers must ensure the operability of both hardware and software, especially when handling hardware failures or applying security patches and feature fixes without disrupting running virtual machines.

Achieving seamless live migration and hot‑upgrade is a complex systems problem because KVM involves many components (SR‑IOV, Linux kernel, QEMU, DPDK, KubeVirt, OpenStack, etc.) with intricate interfaces and internal states. Maintaining VM continuity while upgrading the host requires careful design or modification of these layers.

At the KVM Forum in September, ByteDance's System Technologies & Engineering (STE) team announced a novel solution that transparently supports PCI‑passthrough devices during KVM hot‑upgrade, cutting the minimum downtime from 1000 ms to 160 ms—a 5.25× improvement.

IOMMU State Preservation

PCI passthrough is common in data‑center KVM deployments, providing high‑performance I/O to VMs but adding operational complexity. Hot‑upgrade and migration compatibility is difficult because the hypervisor cannot directly see the state of passthrough devices. Existing approaches either serialize device state (requiring hardware support and adding latency) or attempt to keep the device unchanged, which can increase development cost and affect performance.

Technical Survey

PCI passthrough in KVM uses the VFIO‑PCI interface, an abstraction of IOMMU and PCI logic that allows user‑space VMMs like QEMU to map hardware resources directly into the guest. Two possible strategies for hot‑upgrade were examined:

Serialize the device state, back it up, and restore it after the upgrade. This works on some newer hardware but incurs significant downtime.

Leave the device state untouched during the upgrade, relying on kernel modifications to isolate and protect the state. Experiments showed that modest changes to the Linux kernel on Intel IOMMU can provide generic hot‑upgrade support for PCI‑passthrough devices without requiring special hardware.

Solution

The proposed solution consists of three improvements.

Improvement 1 – Static Page Allocation in the Hypervisor

Introduce static page allocation to preserve state during the kexec reboot. The approach uses a memmap‑based DAX allocation on the host side, creates a DevDax character device, and maps it into QEMU.

<code>ndctl create-namespace -m devdax</code>

QEMU is then started with a memory‑backend file pointing to the DevDax device:

<code>$qemu ... -object memory-backend-file,id=mem,size=2G,mem-path=/dev/dax1.0,share=on,align=2M -numa node,memdev=mem</code>

The allocated memory is later used by KVM to fill EPT page tables.

Improvement 2 – Kernel‑Space Static Allocation via KRAM

A new kernel module, KRAM, provides two APIs for allocating fixed physical pages:

kram_get_fixed_page(area, index)

kram_alloc_page()

A new E820 type reserves pages for KRAM (e.g., memmap=*:* ). The Intel IOMMU driver is patched to use KRAM for allocating root and DMAR pages, replacing the original allocation calls.

Improvement 3 – VFIO‑PCI Simplification

A new flag is added to the VFIO_GROUP_SET_FLAGS ioctl, allowing QEMU to skip VFIO‑PCI device initialization and reset during hot‑upgrade, thus preserving hardware state.

The code for improvements 2 and 3 will be open‑sourced later.

Proof‑of‑Concept Verification

Experiments were performed in a QEMU nested‑virtualization environment on Intel CPUs. The test setup used KVM’s nested virtualization and QEMU’s virtual IOMMU support.

<code>$qemu -machine q35 -device intel-iommu,intremap=on -device e1000e,netdev=guestnet</code>

A virtual e1000e NIC was passed through to a second‑level VM:

<code>$qemu ... -device vfio-pci,addr=06.0,host={dev}</code>

The upgrade flow consisted of cpr‑save → kexec → qemu start → restore . Measured downtime from VM pause to resumed operation was 159 ms, with the NIC remaining active and no packet loss. In contrast, using the mainline kernel/QEMU with save‑vm/load‑vm incurred >1000 ms downtime and could not handle VFIO‑PCI devices.

Conclusion

By deploying the modified host kernel and QEMU with the three improvements, hot‑upgrading the host kernel while preserving VFIO‑PCI passthrough devices can be completed in roughly 160 ms. This technique is highly valuable for public and private‑cloud IaaS scenarios, reducing operational cost, enhancing security, and improving VM performance and user experience. The STE team will continue to optimize the Linux kernel and virtualization stack and share further advances in related areas such as Virtio standards, QEMU hot‑upgrade, Linux boot time, io_uring, and kexec.

cloud computingVirtualizationhot upgradeKVMPCI passthrough
ByteDance SYS Tech
Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.