CFS: Scaling Metadata Service for Distributed File System via Pruned Scope of Critical Sections - Baidu's Implementation Journey
Baidu’s CFS metadata service scales to billions of files by shrinking critical sections through a lock‑free Namespace 2.0 design that confines conflicts to single shards, uses field‑level atomic primitives, and integrates the proxy into the client, delivering up to 76× throughput gains and significant latency reductions in production.
This article provides an in-depth technical analysis of Baidu Cloud File Storage (CFS) metadata system, based on their EuroSys 2023 paper. The authors explain how they solved the long-standing challenge in file system metadata: balancing POSIX compatibility with high scalability, particularly write scalability, which is critical for distributed file systems to scale to hundreds of billions of files while maintaining high performance.
The article begins by establishing the abstraction model for file system metadata, explaining the requirements for write operations (associated changes, rename) and read operations (point reads for lookup and getattr, range reads for readdir). It then traces the evolution of metadata service architectures through three phases: single-point metadata architecture, coupled distributed metadata architecture, and decoupled metadata architecture.
The core innovation lies in Namespace 2.0, which achieves lock-free operations by progressively shrinking the critical section scope. The approach involves three key steps: (1) Using appropriate data layout to confine conflicts to a single shard, (2) Implementing single-shard atomic primitives that reduce row-level conflicts to field-level atomic operations with automatic conflict merging, and (3) Streamlining the metadata proxy layer by integrating it into the client.
The system architecture consists of four components: Namespace storage layer (TafDB), file storage layer (FileStore), rename service (Renamer), and client library (ClientLib). The implementation introduces novel concepts like delta apply for numeric attributes (links, children, size) and last-writer-win for overwrite operations (permissions, mtime, ctime).
Experimental results show that CFS achieves 1.76-75.82x throughput improvement over HopsFS and 1.22-4.10x over InfiniFS, with up to 91.71% and 54.54% latency reduction respectively. The system has been running stably in production for over three years, supporting big data, AI, container, and life sciences workloads.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.