Cloud Native 42 min read

How Baidu’s CFS Achieved Billion‑File Scale with a Lock‑Free Metadata Service

This article explains the design and evolution of Baidu Cloud File System's (CFS) metadata service, detailing how a novel lock‑free architecture and strategic data layout enable POSIX‑compatible, highly scalable storage that can handle billions of files while maintaining high performance and consistency.

Baidu Intelligent Cloud Tech Hub

May 25, 2023

How Baidu’s CFS Achieved Billion‑File Scale with a Lock‑Free Metadata Service

1. Introduction

This article interprets Baidu's EuroSys 2023 paper "CFS: Scaling Metadata Service for Distributed File System via Pruned Scope of Critical Sections" and shares the story behind its innovation to help readers understand the breakthrough.

2. Background

2.1 File System Concepts

A file system stores data and metadata in a hierarchical namespace where directories form a tree structure. Metadata includes attributes such as size, timestamps, and link counts, while data contains the file contents.

Two major implementation styles exist: POSIX‑compatible and HDFS‑like. POSIX defines a comprehensive interface; HDFS simplifies some features for large‑scale workloads.

2.2 Abstracting Metadata Problems

The core abstraction requires correct implementation of POSIX semantics, including atomic linked‑change operations for directory updates and point reads for inode attributes.

2.3 Distributed Metadata Service

Modern systems separate metadata and data services. While data services scale easily, metadata services face challenges due to hierarchical dependencies.

Key evaluation metrics are scalability (both size and performance), latency, and load balancing.

3. Evolution of CFS Metadata Architecture

3.1 Namespace 1.0 (Early Separation)

Initial design excluded single‑node architectures, adopting a split‑layer approach with a database layer (TafDB) for metadata and a file store for attributes.

3.2 Namespace 1.X (Iterative Optimizations)

Optimizations included simplifying write paths, caching, and pre‑processing transaction conflicts, which doubled write throughput but still lagged behind single‑node performance.

3.3 Namespace 2.0 (Lock‑Free Design)

By refining data layout, CFS reduced conflict domains to single shards and introduced atomic primitives that allow lock‑free updates, achieving both high scalability and low latency.

4. Implementation Details

4.1 Metadata Organization and Sharding

Each directory entry is split into an inode‑id record and an attributes record, stored in TafDB with primary key <kID, kStr>. Sharding is based on kID, guaranteeing that all records of a directory reside in the same shard.

4.2 Reducing Distributed‑Lock Overhead

Operations are ordered to avoid cross‑component conflicts (e.g., create file attribute first, then inode record). Deletions reverse this order. This ensures consistency without global locks.

4.3 Single‑Shard Atomic Primitives

Three primitives (insert‑with‑update, delete‑with‑update, insert‑and‑delete‑with‑update) combine condition checks, reads, and writes into a single atomic transaction, reducing round‑trips to one.

4.4 Conflict Merging

Numeric fields (links, children, size) use delta‑apply (additive merging), while timestamp and permission fields use last‑writer‑wins semantics, further shrinking conflict scope.

4.5 Removing the Metadata Proxy Layer

All operations except complex rename are handled by the client library, eliminating the separate proxy service.

4.6 Strong‑Consistency Rename

Rename is split into fast‑path (single‑shard) and normal‑path (Raft‑based) handling, ensuring correctness even under concurrent modifications.

4.7 Garbage Collection

Periodic reconciliation and on‑demand cleanup reclaim orphaned records caused by failures.

5. Evaluation

Experiments on a 50‑node cluster show CFS achieving 1.22‑75.82× higher throughput and up to 91.71% lower latency compared to HopsFS and InfiniFS, especially under high contention and large directories.

6. Conclusion

CFS demonstrates that a carefully designed, lock‑free metadata service can overcome the scalability and latency limitations of traditional distributed file system architectures, supporting billions of files with strong POSIX compatibility in production for over three years.

Scalability cloud storage lock‑free distributed file system metadata service

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.