Cloud Computing 32 min read

How Baidu CFS Scales to Billions of Files with a Lock‑Free Metadata Service

This article explains Baidu's CFS architecture for building a billion‑file‑scale distributed file system, covering basic file system concepts, POSIX limitations, metadata service modeling, performance metrics, evolution of metadata architectures, and CFS's lock‑free design that achieves high scalability, low latency, and balanced load in cloud storage.

Baidu Intelligent Cloud Tech Hub

Aug 29, 2023

How Baidu CFS Scales to Billions of Files with a Lock‑Free Metadata Service

Background

This material is compiled from the DataFunSummit 2023 "Data Infrastructure Summit – Large‑Scale Storage Architecture Forum" and presents the talk titled "Baidu CangHai·Storage: Building a Distributed File System with Billions of Files".

The talk is divided into three parts: basic file‑system concepts, metadata‑service modeling and analysis, and the CFS metadata‑service architecture.

1. Basic Concepts of File Systems

1.1 What Is a File System?

File storage is a ubiquitous form of storage on every phone and computer. When we browse directories or manipulate files through a file explorer, we are interacting with a file system, which is characterized by a hierarchical directory‑tree structure.

The authoritative standard for file systems is POSIX, which defines required interfaces but does not constrain implementations, leading to diverse designs. POSIX was created before large‑scale distributed file systems existed, so it lacks specifications for core distributed challenges such as multi‑machine consistency.

Because of performance and implementation difficulty, no file system achieves 100 % POSIX compatibility. In practice, HDFS, for example, diverges significantly from POSIX, but the analysis and solutions presented apply to HDFS and all other distributed file systems.

1.2 POSIX Specifications for File Systems

POSIX defines three categories of interfaces:

File operations (open, close, read, write, etc.)

Directory‑tree operations (create, delete, list children, etc.)

Attribute operations (get/set permissions, timestamps, ownership, etc.) – collectively called metadata operations.

POSIX also specifies detailed behavior for many calls; two notable points are the handling of close (POSIX does not require data to be persisted on close, which can break consistency in distributed systems) and directory traversal (POSIX does not require sorted output, though tools like ls sort the results).

1.3 History of File‑System Development

Single‑machine era – e.g., EXT4, XFS.

Dedicated‑hardware era – external disks (iSCSI/NVMeoF), shared‑disk file systems (OCFS2), and the emergence of NAS and parallel file systems for HPC/AI.

Software‑defined era – commodity servers running distributed file systems, with consistency handled by protocols such as Paxos or Raft; this is the dominant approach today.

1.4 Components of Modern Distributed File Systems

Modern systems consist of three parts: client, metadata service, and data service.

Client: provides the entry point (SDK for HDFS, NFS/SMB for POSIX, or system‑call interception).

Metadata service: maintains the hierarchical structure and attributes, supporting POSIX or HDFS protocols.

Data service: stores file data, handles layout, redundancy (replication or erasure coding), and I/O patterns.

1.5 Metadata Service as the Bottleneck for Scale

Metadata operations dominate many workloads, especially for small files, and are harder to scale than data operations because of parent‑child dependencies in the directory tree.

2. Metadata‑Service Modeling and Analysis

2.1 Abstract Model

Operations are divided into reads and writes. All writes must perform an "associated change" – an atomic update of the parent directory’s attributes when a child is created or deleted. Example: updating the parent’s modification time enables efficient client caching.

Rename is the most complex write because it may involve two parent directories and additional POSIX constraints.

2.2 Evaluation Metrics

Scalability – size scalability (number of directory entries) and performance scalability (QPS growth with added nodes).

Latency – per‑request response time for reads and writes.

Balance – ability to distribute hot spots evenly across nodes.

2.3 Evolution of Metadata Architectures

Single‑point (HDFS, GFS) – no scalability or balance, low latency at small load.

Coupled distributed (CephFS, HDFS Federation) – sharding by hash or subtree, scales to hundreds of billions, but cannot migrate directories online and suffers from hot‑spot imbalance.

Separated architecture (using distributed KV/NewSQL) – leverages mature transaction systems (Paxos/Raft) to achieve linear scalability and balance; exemplified by Facebook’s Tectonic.

3. CFS Metadata‑Service Architecture

3.1 Key Challenges

In cloud environments, large critical sections (locks) severely degrade performance. CFS aims to shrink lock granularity to the level of a metadata shard, approaching a lock‑free design.

3.2 Example: Creating a File

Read and lock the parent directory.

Insert the new file’s record.

Update the parent directory’s attributes (associated change).

Unlock the parent and commit.

Without the lock, concurrent creations could corrupt the child‑count, leading to orphan directories.

3.3 Lock‑Free Redesign

CFS separates each directory entry into an id record (for path lookup) and an attr record (for attributes). By colocating a parent’s attr with its children’s id in the same shard, the conflict range is reduced to a single shard.

File attributes are moved from the metadata service to the data service, relieving pressure on the metadata tier.

3.4 Optimizing Data Layout

Both id and attr are stored in a single table with primary key <parent_inode, name> for id and <inode, /ATTR> for attr . The special name /ATTR never collides with real file names, allowing parent attributes and child ids to be stored contiguously and sharded together.

3.5 Single‑Shard Primitive

When a transaction involves only one participant, it can be reduced from 2‑PC to 1‑PC. CFS further optimizes by applying two field‑level mechanisms:

Delta Apply – merges concurrent additive updates (e.g., incrementing child count).

Last‑Writer‑Win – overwrites with the most recent value for assignment‑type fields (e.g., timestamps).

3.6 Removing the Metadata Proxy Layer

After layout and primitive optimizations, the metadata proxy becomes unnecessary. Except for a slow‑path rename service, all other requests are sent directly from the client to the underlying distributed KV store (TafDB).

3.7 Overall CFS Architecture

The client library (ClientLib) splits operations into four categories:

File‑data semantics – handled by the FileStore data service.

File‑attribute semantics – also handled by FileStore.

Namespace semantics – sent directly to TafDB.

Rename semantics – processed by a dedicated rename service for complex cases.

TafDB is Baidu’s self‑developed distributed KV system used by both CFS and the object store BOS.

3.8 Test Results

Benchmarks show that both read and write paths scale to millions of QPS, demonstrating that the innovations enable unprecedented scale for file‑system workloads.

Q&A Highlights

Dynamic directory migration is supported; TafDB automatically splits, merges, and balances shards without service interruption.

Recursive delete is not part of POSIX; clients must enumerate and delete entries individually.

Metadata sharding rules collocate parent attr with child id using the /ATTR sentinel, ensuring conflicts stay within a single shard.

NVMe SSD optimizations exist at the storage‑engine layer (TafDB) but are not visible in the metadata design.

For pre‑generated billions of small files, CephFS is a common open‑source choice, though its dynamic balancing is limited.

Deletion is a soft delete in TafDB; a background GC reclaims space, but no user‑visible recycle‑bin feature is currently exposed.

— End of presentation —

scalability cloud storage distributed file system metadata service lock-free design

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.