Design and Analysis of 3FS: An AI‑Optimized Distributed File System
The article provides a comprehensive English overview of 3FS, an AI‑focused distributed file system that leverages FoundationDB for metadata, CRAQ for chunk replication, and a hybrid Fuse/native client architecture, detailing its design, components, fault handling, and performance considerations for large‑scale training workloads.
3FS Design Document
DeepSeek AI released the 3FS project in February 2025, attracting strong industry interest due to its AI‑centric storage capabilities. Ant Group’s Data Intelligence team conducted an early investigation, analyzing design documents and source code to explain how 3FS supports AI workloads such as training data preprocessing, dataset loading, checkpointing, KVCache for inference, and embedding vector search.
System Architecture
Overall Architecture
Core Components
Cluster Manager – manages cluster state and fail‑over (uses ZK or etcd for leader election).
Metadata Service – stores file metadata in an external FoundationDB cluster, relying on its strong consistency.
Storage Service – stores data chunks; replication follows the CRAQ protocol to guarantee high availability.
Client – uses a Fuse client for latency‑insensitive workloads and a native client for performance‑critical applications.
External Dependencies
ClickHouse – stores generated metrics.
FoundationDB – provides transactional metadata storage.
Zookeeper/etcd – support the Cluster Manager’s multi‑replica leader election.
MetaService (Metadata Service)
Design
Similar to ADLS, 3FS stores metadata in FoundationDB using SSI (snapshot‑serializable) transactions, which are equivalent to serializable isolation. All directory‑tree operations are executed as transactions, eliminating consistency anomalies such as rename‑cycle problems.
Implementation
Metadata operations use FoundationDB transactions:
Read‑only transactions for queries: fstat , lookup , listdir , etc.
Read‑write transactions for updates: create , link , unlink , rename , etc.
Conflicts automatically trigger transaction retries, simplifying error handling.
Analysis
The approach mirrors systems such as HopsFS, Tectonic, and Baidu CFS, moving complexity to the underlying distributed KV store and enabling stateless metadata services.
Chunk Storage Service
Data is striped similarly to Ceph. After a file is created, a configurable number of chains (CRAQ replicas) are allocated. Each chunk’s ID and location are deterministic: chunk_id = {inode}{chunk_index} , and the chain is selected via a shuffle seed and chain table stored in MetaService.
Data Structures
Metadata
Key
Value
Description
dentry
DENT{parent_ino}{name}
{parent_ino}{ino}{chain_table}{chunk_size}{stripe_size}{owner}{permission}{atime}{mtime}{ctime}
Prefix isolates directory entries in FoundationDB.
inode
INOD{ino}
{file_length}{chunk_size}{selected_range_in_chain_table}{shuffle_seed}{owner}{permission}{atime}{mtime}{ctime}{target_path (for symlinks)}
Little‑endian encoding spreads keys across shards.
Dynamic File Attributes
Delayed reclamation of files deleted while write‑opened, using a delay_unlink prefix to track active clients.
Lazy file‑length updates: clients report the maximum write offset every few seconds; MetaService adopts the new length if no concurrent truncate occurs.
Fault Handling
Each chain maintains a version number visible to clients. When a node fails or a lease expires, the Cluster Manager increments the chain version; clients with stale versions are rejected, providing a form of fencing.
Chunk and chain versions are managed separately: chunk versions are internal to CRAQ, while chain versions are exposed for client‑side consistency.
MetaService Lifecycle
MetaService is stateless; if its lease expires or it crashes, the Cluster Manager removes it and clients retry against another MetaService instance, updating topology as needed.
Data Node Recovery
When a data node leaves, its chains are marked as failed, a new target is added, and the system syncs missing chunks before the node rejoins the cluster.
Fuse and Native Client
3FS uses a user‑space libfuse daemon for control operations (open, close, stat) while data paths use shared memory (IB “lov” and “lor” rings) for zero‑copy transfers. This design avoids kernel‑space overhead and enables high‑throughput RDMA communication.
Training‑Related Workloads
3FS accelerates checkpointing by splitting model and optimizer files into small chunks, asynchronously moving data from GPU to CPU, and overlapping metadata processing with I/O. Reported single‑node throughput reaches 10 GB/s, with checkpoint intervals as low as five minutes.
Summary
3FS is an AI‑focused, read‑optimized storage system that co‑designs file system, network, and training framework layers. Its strengths include zero‑copy RDMA data paths, metadata scalability via FoundationDB, and support for concurrent writes and overwrites, making it well‑suited for large‑scale model training.
AntData
Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.