Design and Architecture of Baidu CFS Large‑Scale Distributed File System and Metadata Service
The talk from DataFun Summit 2023 explains how Baidu's CFS storage builds a trillion‑file‑scale distributed file system by revisiting file system fundamentals, POSIX limitations, historical storage architectures, and introducing a lock‑free metadata service with single‑shard primitives, data‑layout optimizations, and a simplified client‑centric architecture that achieves high scalability and performance.
This article is compiled from the DataFun Summit 2023 talk titled "Large‑Scale Storage Architecture Forum" and presents Baidu's CFS (Cloud File System) design.
1. Basic Concepts of File Systems
File storage is the everyday way we store data on phones and computers; a file system provides a hierarchical directory tree that users interact with via file browsers. The POSIX standard defines the required interfaces but was created before large‑scale distributed file systems existed, leaving gaps such as multi‑machine consistency.
Because POSIX does not enforce 100% compatibility, implementations differ: HDFS, local file systems, and other distributed systems each make trade‑offs. The talk notes that HDFS heavily modifies POSIX semantics, yet the analysis applies to all distributed file systems.
1.2 POSIX Specifications
POSIX defines three categories of interfaces: file operations (open, read, write, close), directory tree operations (create, delete, list), and attribute operations (chmod, chown, timestamps). These are collectively called metadata operations.
Two notable POSIX points are: (1) close does not guarantee data persistence, which can cause visibility issues in distributed systems; the industry often adopts a "close‑to‑open" (CTO) consistency model. (2) POSIX does not require directory listings to be sorted, though tools like ls sort results internally.
1.3 History of File Systems
Three development stages are described: the single‑machine era (e.g., EXT4, XFS), the dedicated‑hardware era (remote disks, iSCSI/NVMeoF, NAS, parallel file systems), and the software‑defined era where commodity servers and distributed transaction systems (Paxos, Raft) enable massive scale.
1.4 Modern Distributed File System Components
A modern system consists of three parts: client, metadata service, and data service. Clients provide entry points (SDK, NFS/SMB, or system‑call interception). Metadata services maintain hierarchy and attributes, focusing on scalability (both size and performance). Data services store file contents and handle layout, redundancy, and performance trade‑offs.
2. Modeling and Analyzing Metadata Service
Metadata operations are divided into reads (lookup, getattr, directory traversal) and writes (create, delete, rename). Writes must perform "associated changes" atomically, such as updating a parent directory's attributes when a child is added.
Rename is the most complex write because it may involve two parent directories and must avoid cycles and orphan nodes.
2.2 Metrics for Metadata Service
Key metrics include scalability (number of directory entries the system can store), performance scalability (QPS growth with added nodes), latency per request, and load‑balancing (ability to spread hot spots across nodes).
2.3 Evolution of Metadata Architectures
Three stages: (1) single‑point metadata (HDFS, GFS) – low latency but no scalability; (2) coupled distributed architecture (CephFS, HDFS Federation) – improves scale but ties data and metadata, making online migration hard; (3) separated architecture using distributed KV/NewSQL (e.g., Facebook Tectonic) – achieves trillion‑file scale but write performance can suffer.
3. CFS Metadata Service Architecture
The CFS design builds on the separated architecture and introduces lock‑free techniques.
3.1 Key Problems
Large critical sections caused by locking the whole directory during operations degrade performance; reducing the lock scope is essential.
3.2 Example of File Creation
Read and lock the target directory.
Insert the new file record.
Update the parent directory's attributes (associated change).
Unlock and commit.
Without locking, concurrent creations can corrupt the "children" count, leading to orphan directories.
3.3 CFS Core Idea – Lock‑Free Transformation of Coupled Architecture
CFS shrinks the conflict range to a single metadata shard by (1) co‑locating a directory’s attribute record with its children’s ID records, (2) storing both ID and attribute records in the same table using <parent_inode, name> <inode, /ATTR> for attributes, and (3) splitting shards based on ID ranges so related records stay together.
3.4 Optimizing Data Layout
File attributes are moved from the metadata service to the data service, reducing metadata load and leveraging the larger data‑service scale.
3.5 Single‑Shard Primitives
When a transaction involves a single participant, it can be reduced from 2‑phase commit (2PC) to 1‑phase commit (1PC). CFS further classifies modifications into additive (e.g., incrementing child count) and overwrite (e.g., timestamp) and implements two mechanisms: Delta Apply (merge additive updates) and Last‑Writer‑Win (overwrite with latest value).
3.6 Removing the Metadata Proxy Layer
After layout and primitive optimizations, the proxy layer becomes unnecessary; most requests are sent directly from the client to the underlying distributed KV store (TafDB), except for slow‑path rename operations.
3.7 Overall CFS Architecture
The client library handles most POSIX semantics and forwards four types of operations: file data, file attributes (both to the FileStore service), slow‑path rename (to a dedicated rename service), and remaining namespace operations (directly to TafDB). TafDB is Baidu’s self‑developed distributed KV system used by both CFS and BOS.
3.8 Test Results
Benchmarks show that both read and write workloads scale to millions of QPS, demonstrating the effectiveness of the lock‑free design.
Q&A Highlights
• CFS supports dynamic directory migration via TafDB’s automatic shard splitting/merging. • Recursive delete is not a POSIX operation; clients must enumerate and delete entries individually. • Metadata layout uses <id, name> keys and a reserved /_ATTR name for attribute records, ensuring parent‑attribute and child‑ID stay in the same shard. • NVMe SSD optimizations exist at the engine layer (TafDB runs on NVMe). • For pre‑generated billions of small files, CephFS is a common open‑source choice, though its dynamic balancing is limited.
For more details, refer to Baidu Smart Cloud’s official blog and the EuroSys 2023 paper.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.