How Alibaba’s DBFS Achieved Storage‑Compute Separation for Massive 11.11 Sales
This article details Alibaba's journey from the 2017 pilot of storage‑compute separation to the 2018 large‑scale deployment of the DBFS user‑space file system, highlighting innovations such as zero‑copy I/O, RDMA integration, adaptive page cache, asynchronous I/O, atomic writes, online resize, and hardware‑software co‑design that enabled elastic, high‑performance database operations during the Double‑11 shopping festival.
Background
Facing growing database instances and storage scale, Alibaba began experimenting with a new architecture—storage‑compute separation—in 2017 to reduce costs and improve scheduling efficiency for large‑scale promotions.
2017 Milestones
In 2017, leveraging Pangu and AliDFS (a Ceph fork), Alibaba achieved storage‑compute separation for 10% of transaction volume in the Zhangbei unit, establishing the foundation for full‑scale deployment in 2018.
2018 Technical Breakthroughs
2018 marked the transition from prototype to massive deployment, focusing on performance, universality, and simplicity. The key innovation was the user‑space cluster file system DBFS , which powered elastic scaling without data movement.
2.1 User‑Space Techniques
2.1.1 Zero‑Copy
DBFS bypasses the kernel, implementing a true zero‑copy I/O path that eliminates the double copy between user and kernel space, dramatically improving throughput and latency.
2.1.2 RDMA
By integrating RDMA with Pangu storage, DBFS achieves near‑SSD latency and high throughput across the network, supporting the industry‑largest RDMA clusters for Alibaba’s promotions.
2.2 Page Cache
A touch‑count‑based LRU algorithm moves pages between hot and cool zones, preserving frequently accessed data during large table scans. Configurable page size aligns with database page sizes for optimal cache efficiency.
Touch‑count driven hot/cool migration
Configurable hot‑cold ratio (default 2:8)
Adjustable page size to match database pages
Multi‑shard design for high concurrency
2.3 Asynchronous I/O
DBFS implements lock‑free queues, configurable I/O depth, and adaptive polling to reduce CPU consumption while supporting diverse database I/O patterns.
2.4 Atomic Write
Atomic write guarantees prevent partial writes, allowing InnoDB to safely disable double‑write buffers and achieve 100% bandwidth savings under storage‑compute separation.
2.5 Online Resize
DBFS uses a lock‑free bitmap allocator to resize volumes online without data migration, eliminating the need for reserved spare capacity.
2.6 TCP↔RDMA Switching
DBFS and Pangu provide seamless TCP‑to‑RDMA fallback, with extensive capacity stress testing ensuring stable large‑scale RDMA deployment.
2.7 2018 Promotion Deployment
DBFS successfully passed the end‑to‑end tests of the Double‑11 promotion, validating the practicality of storage‑compute separation.
DBFS as a Storage‑Midplane Solution
3.1 Technical Consolidation and Enablement
All innovations are packaged into DBFS, offering user‑space access to various storage media and empowering databases to achieve storage‑compute separation.
3.1.1 POSIX Compatibility
DBFS supports most POSIX file interfaces and glibc calls, simplifying database integration.
FILE *fopen(const char* path, const char* mode); FILE *fdopen(int fd, const char* mode); size_t fread(void* ptr, size_t size, size_t nmemb, FILE *stream); size_t fwrite(const void* ptr, size_t size, size_t nmemb, FILE *stream); int fflush(FILE *stream); int fclose(FILE *stream); int fileno(FILE *stream); int feof(FILE *stream); int ferror(FILE *stream); void clearerr(FILE *stream); int fseeko(FILE *stream, off_t offset, int whence); int fseek(FILE *stream, long offset, int whence); off_t ftello(FILE *stream); long ftell(FILE *stream); void rewind(FILE *stream);
3.1.2 Fuse Integration
Fuse enables DBFS to interoperate with the Linux VFS, allowing databases to adopt DBFS without code changes.
3.1.3 Service‑Oriented Capability
DBFS implements a lock‑free shared‑memory IPC (shmQ) to support both PostgreSQL (process‑based) and MySQL (thread‑based) workloads, delivering sub‑microsecond latency for large pages.
3.1.4 Cluster File System
DBFS provides a shared‑disk cluster mode with one‑write‑multiple‑read capability, supporting both shared‑disk and shared‑nothing architectures, and offering configurable master/slave roles for high availability.
3.2 Hardware‑Software Co‑Design
3.2.1 Persistent File Cache
Using Intel Optane, DBFS implements a durable local cache that boosts read/write performance under storage‑compute separation, with features such as fault handling, dynamic enable/disable, load balancing, metrics collection, and data scrubbing.
3.2.2 Open‑Channel SSD
Collaboration with X‑Engine and Fusion Engine leverages object‑SSD and SPDK user‑space techniques to reduce SSD wear, improve throughput, and minimize read/write interference.
Conclusion and Outlook
By 2018, DBFS powered large‑scale Double‑11 promotions, enabled one‑write‑multiple‑read for ADS and Tair, and achieved full compatibility with PostgreSQL, MySQL, and Linux VFS, establishing itself as a true storage‑midplane product. Future work will integrate more hardware innovations, tiered storage, and NVMe‑oF to further empower databases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
