How SPDK Boosts Cloud Disk I/O Performance: Hot Upgrade & Online Migration
This article explains how using the Storage Performance Development Kit (SPDK) optimizes virtualized I/O paths for cloud disks, covering its architecture, virtio vring basics, hot‑upgrade solutions, online migration techniques, and io_uring enhancements that together achieve up to 1.2 million IOPS with dramatically reduced latency.
Introduction
Users demanding ultra‑high concurrency and massive scale compute have driven storage hardware to evolve, making storage clusters faster with lower latency. In cloud‑disk scenarios the I/O path from request generation to backend storage and back is complex, and virtualized I/O paths can become performance bottlenecks. We applied SPDK to optimize this path, introduced an open‑source hot‑upgrade and online‑migration solution, and achieved up to 1.2 million IOPS for RSSD cloud disks.
SPDK vhost Basic Principle
SPDK (Storage Performance Development Kit) provides a set of user‑space libraries for building high‑performance, scalable storage applications, featuring poll‑mode, asynchronous, lock‑free NVMe drivers that enable zero‑copy, highly parallel access to SSDs from user space.
Virtio vring Basics
desc table array: holds one entry per I/O request, containing pointers to data buffers and length; unused entries are linked via a free list.
available ring: a circular array of indices pointing to desc entries ready for processing.
used ring: records indices of completed I/O requests for the front‑end driver to reclaim.
During initialization, QEMU’s vhost driver sends these vring structures to SPDK, which continuously polls the available ring, processes requests, and posts completions to the used ring, notifying the virtio front‑end via eventfd.
Performance Comparison
Single‑queue (1 iodepth, 1 numjob)
QEMU cloud‑disk driver latency:
SPDK vhost latency:
Average latency dropped from ~130 µs to 7.3 µs.
Multi‑queue (128 iodepth, 1 numjob)
QEMU cloud‑disk driver latency:
SPDK vhost latency:
Average latency decreased from ~3341 µs to 1090 µs, roughly one‑third of the original.
SPDK Hot Upgrade
Initial SPDK releases lacked hot‑upgrade capability, causing I/O to stall when the SPDK process crashed or was restarted. By storing per‑vring request status in shared memory allocated by QEMU, SPDK can recover unfinished I/Os after an unexpected crash. The shared memory survives SPDK restarts, enabling automatic reconnection and state restoration.
SPDK Online Migration
While SPDK excels at device‑state migration, it does not support data‑plane online migration, which QEMU provides for regular block devices. To bridge this gap, we implemented a dedicated data‑flow I/O path in QEMU, created a shared bitmap in memory to track dirty blocks, and used a robust pthread mutex to survive crashes. The architecture enables block‑level live migration with SPDK as the backend.
SPDK io_uring Experience
io_uring, introduced in Linux 5.1, reduces system‑call overhead by sharing memory between user and kernel space. SPDK 19.04 includes io_uring support in its bdev layer, but it is not enabled by default. To evaluate it, we:
Installed the latest liburing library and enabled io_uring in SPDK’s config.
Added RPC calls for creating io_uring bdevs, mirroring other bdev implementations.
Adjusted SPDK code to use io_uring_peek_cqe and io_uring_cqe_seen instead of the removed io_uring_get_completion.
Opened files with O_SYNC to ensure data durability and added read/write modes.
Performance tests show io_uring bdevs achieve roughly 20 % higher IOPS and 10 % lower latency compared with the traditional aio bdev, though the results are bounded by underlying hardware limits.
Conclusion
Applying SPDK to the virtualized I/O path eliminates performance bottlenecks and enables UCloud’s high‑performance cloud‑disk product to fully exploit backend storage capabilities. While integration presented challenges—such as hot‑upgrade handling, crash recovery, and online migration—we resolved many issues and contributed fixes back to the SPDK community, paving the way for further innovations.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
