How Alibaba Cloud’s CSI Layered Storage Delivers SSD Speed with Cloud‑Disk Reliability
In the cloud‑native era, Alibaba Cloud’s CSI‑based hierarchical storage combines local NVMe SSD performance with cloud‑disk durability, offering a three‑layer design, operational simplicity, and up to 100× IOPS gains for database and AI workloads.
Background
In cloud‑native environments, database and AI workloads require both the ultra‑low latency of local NVMe SSDs and the durability of cloud block storage. A hierarchical storage solution built on the Kubernetes Container Storage Interface (CSI) combines these properties.
Design Trade‑offs
Local NVMe SSD : provides millions of IOPS and sub‑millisecond latency but data is lost if the node fails.
Cloud block storage (e.g., Alibaba Cloud ESSD/EBS) : offers persistence, snapshots and elastic scaling, but performance is limited by network bandwidth.
Architecture – Three‑Layer CSI Driver
The driver runs as a standard CSI plugin on each node and uses Linux dm‑cache to present a single virtual block device that merges three layers:
Origin layer : the remote cloud block device.
Cache layer : a fast local SSD (or RAID‑0 of multiple NVMe disks).
Metadata layer : a small loop device that stores dm‑cache metadata.
Key implementation steps:
Aggregate one or more NVMe disks into a RAID‑0 array with mdadm to increase bandwidth.
Format the RAID device with XFS and enable Allocation Groups for high‑concurrency I/O.
Pre‑allocate a file on the cloud disk using fallocate, then expose it as a block device with losetup. This preserves the original cloud‑disk format.
Create three block devices (metadata, cache, origin) and combine them with dm‑cache, exposing /dev/md‑0 to containers.
Operational Benefits
Online elastic scaling : both cache size and cloud‑disk capacity can be expanded without pod disruption.
Multi‑attach support : because metadata resides locally, the same cloud disk can be attached read‑only to multiple nodes.
Fast cloud‑backup expansion : snapshots and capacity growth are performed on the origin layer while the cache continues serving I/O.
Automatic failover : if no node with local SSD is available, workloads are scheduled on instances that use only the cloud disk.
Zero‑intrusion migration : the CSI driver presents a standard block‑storage interface; existing applications can mount it without modification.
Minimal daemon footprint : no additional user‑space daemons are required, reducing CPU and memory overhead.
Performance Evaluation
Tested with a 120 GB Alibaba Cloud ESSD volume as the origin layer and a 100 GB local SSD (RAID‑0) as cache. Results:
Random read : 1,620 IOPS (baseline) → 224,000 IOPS with dm‑cache.
Sequential write (write‑back) : 132 MiB/s → 3,500 MiB/s.
Note: dm‑cache may bypass the cache for pure sequential reads to protect SSD endurance, but the observed gains in random reads and sequential writes are significant for latency‑sensitive workloads.
Comparison with Alternative Approaches
LVM : embeds metadata in the cloud‑disk header, breaking portability and requiring manual cleanup during pod migration.
Self‑managed Ceph or similar clusters :
Reliability : larger fault domain; risk of cascade failures.
Complexity : requires dedicated storage operators; CSI driver automates OSD/monitor management.
Performance : local‑bus access avoids 10 Gb/25 Gb network bottlenecks.
Functionality : retains cloud‑disk features such as snapshots and elastic scaling.
Reference Implementation
The driver source code and deployment manifests are available at:
https://github.com/kubernetes-sigs/alibaba-cloud-csi-driver
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
