Inside DeepSeek 3FS: Architecture of a High‑Performance Parallel File System
This article dissects DeepSeek's 3FS parallel file system, detailing its four‑component architecture, high‑throughput RDMA networking, metadata handling with FoundationDB, client access methods, chain replication (CRAQ), custom FFRecord format, and recovery mechanisms, offering a deep technical perspective for storage engineers.
DeepSeek 3FS is a parallel file system designed to power all DeepSeek data access, leveraging modern SSDs and RDMA networks for high throughput.
1. Overall Architecture
3FS consists of four main components: Cluster Manager, Client, Meta Service, and Storage Service. All components communicate over an RDMA network (InfiniBand in DeepSeek).
Cluster Manager
The Cluster Manager acts as the control plane, handling node management, leader election, and heartbeats.
Implements multi‑node hot standby using FoundationDB for leader election.
Monitors Meta Service and Storage Service nodes via periodic heartbeats and notifies the cluster of status changes.
Reclaims file write handles from disconnected clients.
Client
Provides two access methods:
FUSE client (hf3fs_fuse) : Easy to use, supports common POSIX interfaces, but not optimal for performance.
Native client (USRBIO) : SDK‑based, requires application code changes, delivers 3‑5× higher performance than FUSE.
Meta Service
Offers metadata services with a compute‑separate design:
Metadata is persisted in FoundationDB, which supplies transactional semantics for directory tree operations.
Stateless and horizontally scalable; translates POSIX directory operations into FoundationDB read/write transactions.
Storage Service
Provides data storage with a compute‑integrated design:
Each storage node manages local SSD resources and offers read/write capabilities.
Data is stored with three‑way replication using the CRAQ (Chain Replication with Apportioned Queries) protocol, which gives write‑all‑read‑any semantics.
Data is chunked and distributed across multiple SSDs for load balancing.
2. Detailed Architecture – Cluster Management
A 3FS cluster can have one or more management service nodes (mgmtd). Only one mgmtd is the leader; others are standby and respond to queries. All nodes report heartbeats to the leader, which maintains routing information.
Leader Election
Leader election relies on leases stored in FoundationDB. Each mgmtd checks lease validity every 10 seconds; the node with a valid lease becomes the leader. The process is serialized by FoundationDB transactions, preventing split‑brain scenarios.
3. Detailed Architecture – Client
FUSE Client
Implemented with libfuse low‑level API (requires libfuse ≥ 3.16.1) and uses C++20 coroutines. Each request passes through kernel VFS, FUSE forwarding, and user‑space daemon, incurring four context switches and 1‑2 copies, which limits performance.
USRBIO Native Client
Uses a zero‑copy, asynchronous API based on shared‑memory ring buffers (Iov and Ior files). It eliminates kernel‑user transitions, achieving 3‑5× higher throughput than the FUSE client.
USRBIO details can be found in the API reference: https://github.com/deepseek-ai/3FS/blob/main/src/lib/api/UsrbIo.md
4. Detailed Architecture – Storage Service
CRAQ Replication
Data nodes form chains; writes propagate from the chain head to the tail, and the tail confirms writes. CRAQ’s write‑all‑read‑any property improves read performance, and the system supports dynamic stripe sizing to limit the number of chains involved for small files.
FFRecord File Format
FFRecord merges many small files into larger records to reduce open‑file overhead. It stores per‑sample offsets and CRC32 checksums for random access and data integrity.
Chunk Engine
Each storage node runs a Chunk Engine that manages chunk allocation (64 KiB–64 MiB) using bitmap‑based resource pools, persisting metadata in LevelDB/RocksDB. Writes use copy‑on‑write for modifications and atomic appends for tail writes.
Data Recovery
When a node fails, its targets are marked offline. Recovery proceeds by fetching remote metadata, comparing versions, and synchronizing missing chunks via full‑chunk replacement, ensuring consistency without service interruption.
5. Detailed Architecture – Metadata Service
The Meta Service stores inode and dentry information in FoundationDB, encoding keys with prefixes to simulate logical tables. It provides POSIX‑compatible operations, handling directory inheritance, garbage collection, and transaction‑based conflict resolution.
Conclusion
The article offers an in‑depth look at 3FS’s design choices, highlighting its high‑throughput architecture, RDMA‑based communication, CRAQ replication, and sophisticated client implementations, while noting differences from mainstream distributed file systems and setting the stage for future comparative analysis.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.