Inside 3FS: How DeepSeek’s Parallel File System Powers AI Training
This article dives deep into DeepSeek's 3FS parallel file system, detailing its four-component architecture, RDMA‑based high‑speed networking, client options, metadata and storage services, replication protocols, dynamic stripe sizing, and recovery mechanisms that enable efficient AI model training and inference.
3FS Overall Architecture
Similar to many distributed file systems, 3FS consists of four parts: Cluster Manager, Client, Meta Service, and Storage Service, all interconnected via RDMA (InfiniBand) for high‑speed communication.
Cluster Manager
Acts as the central controller, handling node management with multi‑node hot‑standby for high availability, using FoundationDB‑based leader election, and coordinating heartbeats and client state recovery.
Client
Provides two access methods: a user‑friendly FUSE client (hf3fs_fuse) supporting POSIX interfaces with lower performance, and a native USRBIO client offering SDK‑style integration and 3‑5× higher throughput by eliminating kernel‑user context switches and using zero‑copy RDMA buffers.
Meta Service
Offers metadata services with a compute‑storage separation design, persisting metadata in FoundationDB transactions to support POSIX directory semantics; the service is stateless and horizontally scalable.
Storage Service
Delivers data storage using an integrated compute‑storage model: each node manages local SSDs, stores three replicas via the CRAQ chain‑replication protocol (write‑all‑read‑any), and distributes data chunks across nodes for load balancing.
Cluster Management Details
A 3FS cluster can have one or multiple manager nodes (mgmtd), with a single active leader and backups. Nodes report heartbeats and lease information to the leader, which persists node metadata in FoundationDB to survive leader switches.
Client Architecture
The FUSE client relies on libfuse low‑level API with C++20 coroutines, while USRBIO uses shared‑memory ring buffers for zero‑copy, asynchronous communication, similar to DPDK or io_uring designs.
USRBIO Implementation
Each USRBIO instance uses an Iov file for data buffers and an Ior file for I/O rings.
Shared memory files are symlinked under /3fs-virt/iovs/ to /dev/shm.
Three submit semaphore files manage I/O priorities and notify the FUSE daemon.
Symlink “Black Magic”
Non‑standard operations are implemented via symlink handling, enabling actions like recursive delete, Iov/Ior creation, and config setting without custom ioctl tools.
FFRecord File Format
To mitigate small‑file performance issues, 3FS defines the FFRecord format, which merges many small files, supports random batch reads, and includes CRC32 checksums for data integrity.
Storage Service Architecture
Designed for high throughput, the system scales linearly with SSD and network bandwidth, using CRAQ chain replication for reliability and a dynamic stripe size mechanism to reduce unnecessary storage node communication for small files.
Write and Read Workflows
Writes propagate from the client through the chain head to tail, with acknowledgments flowing back; reads can be served by any node in the chain after version checks, improving read parallelism.
Chunk Engine
Manages chunk files, allocation, and metadata (LevelDB/RocksDB), supporting copy‑on‑write and append‑only writes to avoid write amplification.
Data Recovery
When a storage node fails, it is marked offline; the recovery process fetches remote metadata, synchronizes missing chunks, and writes them to the recovered node using full‑chunk replacement, allowing concurrent writes and recovery.
Metadata Service
Built on FoundationDB, providing transactional KV storage with strong ACID guarantees; Meta Service translates POSIX operations into KV transactions, ensuring consistency and scalability.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
