Unlocking 13× IOPS: Inside UCloud’s High‑Performance SSD Cloud Disk Architecture
UCloud’s latest SSD cloud disk redesign dramatically improves performance—raising IOPS by 13‑fold, cutting latency tenfold, and expanding capacity—through a two‑layer IO path, 1 MB metadata shards, multithreaded models, overload protection, online migration, and upcoming RDMA/SPDK‑based ultra‑high‑performance storage solutions.
Architecture Upgrade Highlights
UCloud launched a high‑performance SSD cloud disk that delivers 13× higher IOPS, 3× better stability, and 10× lower average latency compared with ordinary cloud disks. The redesign began in October of the previous year, reducing latency and boosting IO capability, reusing parts of the architecture for stable ordinary disks, and gradually introducing Stack/Kernel Bypass for next‑generation ultra‑high‑performance storage. After deployment, more than 3,400 cloud‑disk instances (800 TB total) serve live users, with an average cluster IOPS of 310,000 per second.
Eliminate the limitation of the original software architecture that could not fully exploit hardware.
Support SSD cloud disks with QoS guarantees, fully leveraging NVMe physical disks for IOPS and bandwidth; a single disk can reach up to 2.4 W IOPS.
Enable larger capacity disks, up to 32 TB and beyond.
Significantly reduce IO traffic hotspots.
Allow concurrent creation and mounting of thousands of disks.
Support online migration from the old architecture to the new SSD cloud disk.
New Architecture Practice
Improvement 1: IO Path Optimization
In the old architecture the IO path had three layers (Client, Proxy, Chunk). The new architecture collapses it to two layers: the Client directly accesses the Chunk layer, and routing is handled by the Client. This reduces read latency by 0.2‑1 ms per request and shortens write tail latency.
Improvement 2: Metadata Sharding
The new design adopts 1 MB metadata shards (instead of the previous 1 GB) to fully utilize cluster resources and evenly distribute IO hotspots across NVMe SSDs. To avoid metadata allocation failures caused by the massive number of tiny shards, the system loads allocated metadata into memory and allocates routes on‑demand during IO.
Improvement 3: Support for High‑Performance SSD Cloud Disks
NVMe SSDs provide orders‑of‑magnitude higher performance than mechanical disks, but software must be designed to exploit this capability. SSD disks offer QoS guarantees with IOPS = min{1200 + 30 × capacity, 24000}. A multithreaded model replaces the traditional single‑threaded approach, achieving up to 60 k IOPS per thread for writes and 80 k IOPS for reads; five threads can fully utilize NVMe bandwidth.
Improvement 4: Overload Protection
For ordinary disks, queue depth is limited (32‑128). When many disks target the same HDD, IO submission delays and failures increase, leading to client‑side retries and eventual system overload. The new architecture controls concurrent submission queue size and balances IO across idle threads to prevent overload.
Improvement 5: Online Migration
The migration process adds a Trans layer that performs dual writes (to old and new architectures) while data is moved. After all data is transferred, the QEMU connection switches to the new client, completing the online migration with minimal performance impact (target < 5% degradation).
Future Work
Unlimited capacity expansion by partitioning logical disks across multiple storage sets.
Ultra‑high‑performance storage using RDMA, VHOST, and SPDK.
Adoption of zero‑copy, user‑mode polling, and stack bypass techniques to eliminate kernel‑mode overhead.
Summary
Over the past year, UCloud completely redesigned its cloud‑disk storage architecture, solving many limitations of the legacy system and achieving massive performance gains. After four months of public testing, both SSD and the new ordinary cloud disks launched in August, delivering up to 24 k IOPS per disk with high stability. Upcoming kernel/stack‑bypass solutions aim to deliver even higher performance in a public beta expected in December.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
