Inside UCloud’s Cloud Disk Upgrade, Gray-Scale Networking, and Fast Snapshot Design
UCloud’s first Tech Talk detailed a comprehensive redesign of its cloud disk architecture, introduced ServiceMesh‑based gray‑scale deployment for virtual networking, and unveiled a second‑level snapshot system that delivers second‑level continuous backups with rapid restore capabilities, all backed by NVMe and programmable switch innovations.
Topic 1: Cloud Disk Architecture Upgrade and Performance Boost
UCloud’s block storage team redesigned the underlying architecture of its cloud disks, improving ordinary disk performance while adding support for high‑performance NVMe storage.
IO Path Optimization
Previously, IO requests passed through three layers: network, a proxy server handling routing, caching, and replication, and finally the backend storage node, requiring two network hops per operation. The new design splits proxy responsibilities, letting the client handle routing and caching and sending IO directly to the primary chunk, reducing the path to two layers and cutting latency by 0.2‑1 ms on average.
Metadata Sharding
The original 1 GB shard size caused performance bottlenecks for hot‑spot workloads. The new design reduces shard size to 1 MB, dramatically increasing the number of metadata entries. To avoid allocation overhead, the client and backend now compute routing rules locally, allowing metadata allocation and mounting to proceed without central coordination.
NVMe High‑Performance Storage Support
NVMe leverages PCI‑E low latency and parallelism to achieve read/write performance orders of magnitude higher than HDD. UCloud switched from single‑threaded to multi‑threaded IO transmission, using five threads to fully exploit NVMe capabilities, and introduced memory and object pools, array indexing, and reduced string comparisons to maximize throughput.
Topic 2: Gray‑Scale Deployment in UCloud Virtual Network
Technical expert Xu Liang explained how ServiceMesh (based on Istio and Envoy) enables gray‑scale control‑plane deployment, and how programmable switches (Barefoot Tofino) provide gray‑scale capabilities on the data‑plane.
ServiceMesh for Control‑Plane Gray‑Scale
Instead of deploying two full APIGW stacks, ServiceMesh runs a local Envoy sidecar with a customized Pilot that can operate outside Kubernetes, allowing fine‑grained traffic routing based on custom headers, cookies, or account information.
Programmable Switch for Data‑Plane Gray‑Scale
The Barefoot Tofino‑based switch acts as a gray‑scale gateway, offering up to 64 × 100 G interfaces (6.4 TB bandwidth) and PPS performance of 4.4 M, with microsecond‑level latency. It supports consistent‑hash ECMP, custom hash fields (including tenant ID), and per‑TCP‑flow gray‑scale rules.
Example workflow: programmable switch announces a VIP via BGP, hashes traffic based on selected fields, and routes it to backend servers according to gray‑scale policies, with automated rollback testing and incremental VM migration.
Topic 3: Cloud Host Continuous Snapshot and Backup Design
Director Qiu Mo‑Jiong presented UCloud’s “Data Ark” snapshot solution, which replaces traditional OpenStack internal/external snapshots with an automated, second‑level continuous backup system.
Limitations of OpenStack Snapshots
Internal snapshots couple the snapshot file with the original disk, are complex, and lack raw format support. External snapshots separate files but still involve heavy management and performance penalties, especially at large snapshot counts.
Data Ark Architecture
Data Ark records real‑time IO streams from VMs, stores them in SSD tiers, and periodically merges them into hourly, daily, and base layers. This layered approach decouples backup storage from the primary host, reduces intrusion, and enables rapid restores (e.g., a 1 TB disk can be recovered in ~10 minutes using parallel reads across multiple storage servers).
During restore, Data Ark streams are merged back into the target disk or cloud disk, leveraging the cloud disk’s own sharding for high‑throughput writes. For local disks, a copy‑on‑read strategy allows the VM to start before the full snapshot is written.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
