Backend Development 40 min read

CFS: Scaling Metadata Service for Distributed File System via Pruned Scope of Critical Sections - Baidu's Implementation Journey

Baidu’s CFS metadata service scales to billions of files by shrinking critical sections through a lock‑free Namespace 2.0 design that confines conflicts to single shards, uses field‑level atomic primitives, and integrates the proxy into the client, delivering up to 76× throughput gains and significant latency reductions in production.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
CFS: Scaling Metadata Service for Distributed File System via Pruned Scope of Critical Sections - Baidu's Implementation Journey

This article provides an in-depth technical analysis of Baidu Cloud File Storage (CFS) metadata system, based on their EuroSys 2023 paper. The authors explain how they solved the long-standing challenge in file system metadata: balancing POSIX compatibility with high scalability, particularly write scalability, which is critical for distributed file systems to scale to hundreds of billions of files while maintaining high performance.

The article begins by establishing the abstraction model for file system metadata, explaining the requirements for write operations (associated changes, rename) and read operations (point reads for lookup and getattr, range reads for readdir). It then traces the evolution of metadata service architectures through three phases: single-point metadata architecture, coupled distributed metadata architecture, and decoupled metadata architecture.

The core innovation lies in Namespace 2.0, which achieves lock-free operations by progressively shrinking the critical section scope. The approach involves three key steps: (1) Using appropriate data layout to confine conflicts to a single shard, (2) Implementing single-shard atomic primitives that reduce row-level conflicts to field-level atomic operations with automatic conflict merging, and (3) Streamlining the metadata proxy layer by integrating it into the client.

The system architecture consists of four components: Namespace storage layer (TafDB), file storage layer (FileStore), rename service (Renamer), and client library (ClientLib). The implementation introduces novel concepts like delta apply for numeric attributes (links, children, size) and last-writer-win for overwrite operations (permissions, mtime, ctime).

Experimental results show that CFS achieves 1.76-75.82x throughput improvement over HopsFS and 1.22-4.10x over InfiniFS, with up to 91.71% and 54.54% latency reduction respectively. The system has been running stably in production for over three years, supporting big data, AI, container, and life sciences workloads.

scalabilitydistributed file systemmetadata serviceBaidu CFSEuroSys 2023lock-free designPOSIX compatibility
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.