Databases 15 min read

How Alibaba’s DBFS Achieved Storage‑Compute Separation for Massive 11.11 Sales

This article details Alibaba's journey from the 2017 pilot of storage‑compute separation to the 2018 large‑scale deployment of the DBFS user‑space file system, highlighting innovations such as zero‑copy I/O, RDMA integration, adaptive page cache, asynchronous I/O, atomic writes, online resize, and hardware‑software co‑design that enabled elastic, high‑performance database operations during the Double‑11 shopping festival.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s DBFS Achieved Storage‑Compute Separation for Massive 11.11 Sales

Background

Facing growing database instances and storage scale, Alibaba began experimenting with a new architecture—storage‑compute separation—in 2017 to reduce costs and improve scheduling efficiency for large‑scale promotions.

2017 Milestones

In 2017, leveraging Pangu and AliDFS (a Ceph fork), Alibaba achieved storage‑compute separation for 10% of transaction volume in the Zhangbei unit, establishing the foundation for full‑scale deployment in 2018.

2018 Technical Breakthroughs

2018 marked the transition from prototype to massive deployment, focusing on performance, universality, and simplicity. The key innovation was the user‑space cluster file system DBFS , which powered elastic scaling without data movement.

2.1 User‑Space Techniques

2.1.1 Zero‑Copy

DBFS bypasses the kernel, implementing a true zero‑copy I/O path that eliminates the double copy between user and kernel space, dramatically improving throughput and latency.

2.1.2 RDMA

By integrating RDMA with Pangu storage, DBFS achieves near‑SSD latency and high throughput across the network, supporting the industry‑largest RDMA clusters for Alibaba’s promotions.

2.2 Page Cache

A touch‑count‑based LRU algorithm moves pages between hot and cool zones, preserving frequently accessed data during large table scans. Configurable page size aligns with database page sizes for optimal cache efficiency.

Touch‑count driven hot/cool migration

Configurable hot‑cold ratio (default 2:8)

Adjustable page size to match database pages

Multi‑shard design for high concurrency

2.3 Asynchronous I/O

DBFS implements lock‑free queues, configurable I/O depth, and adaptive polling to reduce CPU consumption while supporting diverse database I/O patterns.

2.4 Atomic Write

Atomic write guarantees prevent partial writes, allowing InnoDB to safely disable double‑write buffers and achieve 100% bandwidth savings under storage‑compute separation.

2.5 Online Resize

DBFS uses a lock‑free bitmap allocator to resize volumes online without data migration, eliminating the need for reserved spare capacity.

2.6 TCP↔RDMA Switching

DBFS and Pangu provide seamless TCP‑to‑RDMA fallback, with extensive capacity stress testing ensuring stable large‑scale RDMA deployment.

2.7 2018 Promotion Deployment

DBFS successfully passed the end‑to‑end tests of the Double‑11 promotion, validating the practicality of storage‑compute separation.

DBFS as a Storage‑Midplane Solution

3.1 Technical Consolidation and Enablement

All innovations are packaged into DBFS, offering user‑space access to various storage media and empowering databases to achieve storage‑compute separation.

3.1.1 POSIX Compatibility

DBFS supports most POSIX file interfaces and glibc calls, simplifying database integration.

FILE *fopen(const char* path, const char* mode); FILE *fdopen(int fd, const char* mode); size_t fread(void* ptr, size_t size, size_t nmemb, FILE *stream); size_t fwrite(const void* ptr, size_t size, size_t nmemb, FILE *stream); int fflush(FILE *stream); int fclose(FILE *stream); int fileno(FILE *stream); int feof(FILE *stream); int ferror(FILE *stream); void clearerr(FILE *stream); int fseeko(FILE *stream, off_t offset, int whence); int fseek(FILE *stream, long offset, int whence); off_t ftello(FILE *stream); long ftell(FILE *stream); void rewind(FILE *stream);

3.1.2 Fuse Integration

Fuse enables DBFS to interoperate with the Linux VFS, allowing databases to adopt DBFS without code changes.

3.1.3 Service‑Oriented Capability

DBFS implements a lock‑free shared‑memory IPC (shmQ) to support both PostgreSQL (process‑based) and MySQL (thread‑based) workloads, delivering sub‑microsecond latency for large pages.

3.1.4 Cluster File System

DBFS provides a shared‑disk cluster mode with one‑write‑multiple‑read capability, supporting both shared‑disk and shared‑nothing architectures, and offering configurable master/slave roles for high availability.

3.2 Hardware‑Software Co‑Design

3.2.1 Persistent File Cache

Using Intel Optane, DBFS implements a durable local cache that boosts read/write performance under storage‑compute separation, with features such as fault handling, dynamic enable/disable, load balancing, metrics collection, and data scrubbing.

3.2.2 Open‑Channel SSD

Collaboration with X‑Engine and Fusion Engine leverages object‑SSD and SPDK user‑space techniques to reduce SSD wear, improve throughput, and minimize read/write interference.

Conclusion and Outlook

By 2018, DBFS powered large‑scale Double‑11 promotions, enabled one‑write‑multiple‑read for ADS and Tair, and achieved full compatibility with PostgreSQL, MySQL, and Linux VFS, establishing itself as a true storage‑midplane product. Future work will integrate more hardware innovations, tiered storage, and NVMe‑oF to further empower databases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Database PerformanceRDMAStorage Compute Separationuser-space file systemDBFS
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.