Databases 15 min read

How Alibaba Achieved Extreme Database Elasticity with Hybrid Cloud, Containers, and Storage‑Compute Separation

This article explains how Alibaba transformed its database infrastructure through hybrid‑cloud high‑performance ECS, container‑based multi‑instance deployment, and a user‑space storage‑compute separation architecture with RDMA, dramatically improving resource utilization, scaling speed, and cost efficiency for massive traffic spikes.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba Achieved Extreme Database Elasticity with Hybrid Cloud, Containers, and Storage‑Compute Separation

Databases are resource‑intensive software that rely heavily on CPU, memory, and disk, and their SQL workloads consume varying amounts of I/O and CPU depending on execution plans, making it essential to abstract specifications so diverse database instances can share the same physical machines efficiently.

To support peak traffic such as Alibaba's Double 11 events, several elasticity strategies are employed: using standard public‑cloud resources that are returned after the promotion, mixed deployment of existing workloads (classification and time‑slice mixing), rapid up‑and‑down cycles to shorten resource holding periods, and fragmenting large database instances into smaller ones to leverage fragmented resources.

The cost of a promotion equals held resources multiplied by holding time; thus, generic cloud resources and fast containerized deployment are key to reducing holding periods. Alibaba’s database elasticity has progressed through three stages: hybrid‑cloud elasticity, container elasticity, and storage‑compute separation elasticity.

Hybrid‑cloud elasticity began in 2015 when the team experimented with running databases on high‑performance ECS instances. By leveraging user‑space networking (DPDK) and storage (SPDK) technologies, they achieved less than 10% performance loss compared to local disks, laying the groundwork for later storage‑compute separation breakthroughs.

Container elasticity addressed the limitations of single‑machine multi‑instance deployments (OOM, I/O contention, security, and master‑slave consistency) by adopting containers. Containers provide standardized specifications that decouple databases from hardware models, enable namespace‑based mixing of different database types and versions, and allow seamless integration with other application workloads. By 2017, Alibaba’s database fleet was nearly 100% containerized, delivering a 10‑point utilization gain and faster resource delivery.

Storage‑compute separation elasticity tackles the remaining bottlenecks of moving data to ECS and the high cost of scaling beyond public‑cloud billing cycles. By separating storage and compute, only compute resources need to be expanded during traffic spikes, while storage capacity remains pooled and reused, dramatically lowering costs. Tests showed that with SSD latency around 100‑200 µs and network latency under 10 µs, the combined path can stay within ~500 µs, and the architecture achieved 700 µs response times for 25 G TCP networks.

The final piece is a full user‑space I/O chain: the X‑DB engine calls DBFS, a user‑space file system built on Alibaba’s Pangu distributed storage accessed via RDMA, bypassing the kernel and page cache. This design delivers kernel‑level latency comparable to local Ext4 while supporting high throughput (≈2 GB/s per instance) and stable 500 µs disk response under load.

Operational challenges include incompatibility between container bridge networking and RDMA, requiring host‑network mode for containers, and integrating hybrid‑cloud VPC connectivity for RDMA‑enabled databases. Despite these hurdles, the architecture enabled rapid, ten‑minute cluster scaling and three‑day full‑promotion expansion in 2018, cutting resource waste and improving reliability.

Overall, Alibaba’s journey from high‑performance ECS to containerization and finally to a user‑space storage‑compute separation architecture demonstrates how modern database systems can achieve extreme elasticity, cost efficiency, and performance at massive scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativecontainerizationdatabasesRDMAelasticityStorage Compute Separation
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.