How Baidu’s CFS Achieved Billion‑File Scale with a Lock‑Free Metadata Service
This article explains the design and evolution of Baidu Cloud File System's (CFS) metadata service, detailing how a novel lock‑free architecture and strategic data layout enable POSIX‑compatible, highly scalable storage that can handle billions of files while maintaining high performance and consistency.
1. Introduction
This article interprets Baidu's EuroSys 2023 paper "CFS: Scaling Metadata Service for Distributed File System via Pruned Scope of Critical Sections" and shares the story behind its innovation to help readers understand the breakthrough.
2. Background
2.1 File System Concepts
A file system stores data and metadata in a hierarchical namespace where directories form a tree structure. Metadata includes attributes such as size, timestamps, and link counts, while data contains the file contents.
Two major implementation styles exist: POSIX‑compatible and HDFS‑like. POSIX defines a comprehensive interface; HDFS simplifies some features for large‑scale workloads.
2.2 Abstracting Metadata Problems
The core abstraction requires correct implementation of POSIX semantics, including atomic linked‑change operations for directory updates and point reads for inode attributes.
2.3 Distributed Metadata Service
Modern systems separate metadata and data services. While data services scale easily, metadata services face challenges due to hierarchical dependencies.
Key evaluation metrics are scalability (both size and performance), latency, and load balancing.
3. Evolution of CFS Metadata Architecture
3.1 Namespace 1.0 (Early Separation)
Initial design excluded single‑node architectures, adopting a split‑layer approach with a database layer (TafDB) for metadata and a file store for attributes.
3.2 Namespace 1.X (Iterative Optimizations)
Optimizations included simplifying write paths, caching, and pre‑processing transaction conflicts, which doubled write throughput but still lagged behind single‑node performance.
3.3 Namespace 2.0 (Lock‑Free Design)
By refining data layout, CFS reduced conflict domains to single shards and introduced atomic primitives that allow lock‑free updates, achieving both high scalability and low latency.
4. Implementation Details
4.1 Metadata Organization and Sharding
Each directory entry is split into an inode‑id record and an attributes record, stored in TafDB with primary key <kID, kStr>. Sharding is based on kID, guaranteeing that all records of a directory reside in the same shard.
4.2 Reducing Distributed‑Lock Overhead
Operations are ordered to avoid cross‑component conflicts (e.g., create file attribute first, then inode record). Deletions reverse this order. This ensures consistency without global locks.
4.3 Single‑Shard Atomic Primitives
Three primitives (insert‑with‑update, delete‑with‑update, insert‑and‑delete‑with‑update) combine condition checks, reads, and writes into a single atomic transaction, reducing round‑trips to one.
4.4 Conflict Merging
Numeric fields (links, children, size) use delta‑apply (additive merging), while timestamp and permission fields use last‑writer‑wins semantics, further shrinking conflict scope.
4.5 Removing the Metadata Proxy Layer
All operations except complex rename are handled by the client library, eliminating the separate proxy service.
4.6 Strong‑Consistency Rename
Rename is split into fast‑path (single‑shard) and normal‑path (Raft‑based) handling, ensuring correctness even under concurrent modifications.
4.7 Garbage Collection
Periodic reconciliation and on‑demand cleanup reclaim orphaned records caused by failures.
5. Evaluation
Experiments on a 50‑node cluster show CFS achieving 1.22‑75.82× higher throughput and up to 91.71% lower latency compared to HopsFS and InfiniFS, especially under high contention and large directories.
6. Conclusion
CFS demonstrates that a carefully designed, lock‑free metadata service can overcome the scalability and latency limitations of traditional distributed file system architectures, supporting billions of files with strong POSIX compatibility in production for over three years.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
