Backend Development 22 min read

Inside DeepSeek 3FS: Architecture of a High‑Performance Parallel File System

This article dissects DeepSeek's 3FS parallel file system, detailing its four‑component architecture, high‑throughput RDMA networking, metadata handling with FoundationDB, client access methods, chain replication (CRAQ), custom FFRecord format, and recovery mechanisms, offering a deep technical perspective for storage engineers.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
Inside DeepSeek 3FS: Architecture of a High‑Performance Parallel File System

DeepSeek 3FS is a parallel file system designed to power all DeepSeek data access, leveraging modern SSDs and RDMA networks for high throughput.

3FS architecture diagram
3FS architecture diagram

1. Overall Architecture

3FS consists of four main components: Cluster Manager, Client, Meta Service, and Storage Service. All components communicate over an RDMA network (InfiniBand in DeepSeek).

Cluster Manager

The Cluster Manager acts as the control plane, handling node management, leader election, and heartbeats.

Implements multi‑node hot standby using FoundationDB for leader election.

Monitors Meta Service and Storage Service nodes via periodic heartbeats and notifies the cluster of status changes.

Reclaims file write handles from disconnected clients.

Client

Provides two access methods:

FUSE client (hf3fs_fuse) : Easy to use, supports common POSIX interfaces, but not optimal for performance.

Native client (USRBIO) : SDK‑based, requires application code changes, delivers 3‑5× higher performance than FUSE.

Meta Service

Offers metadata services with a compute‑separate design:

Metadata is persisted in FoundationDB, which supplies transactional semantics for directory tree operations.

Stateless and horizontally scalable; translates POSIX directory operations into FoundationDB read/write transactions.

Storage Service

Provides data storage with a compute‑integrated design:

Each storage node manages local SSD resources and offers read/write capabilities.

Data is stored with three‑way replication using the CRAQ (Chain Replication with Apportioned Queries) protocol, which gives write‑all‑read‑any semantics.

Data is chunked and distributed across multiple SSDs for load balancing.

2. Detailed Architecture – Cluster Management

A 3FS cluster can have one or more management service nodes (mgmtd). Only one mgmtd is the leader; others are standby and respond to queries. All nodes report heartbeats to the leader, which maintains routing information.

Cluster management diagram
Cluster management diagram

Leader Election

Leader election relies on leases stored in FoundationDB. Each mgmtd checks lease validity every 10 seconds; the node with a valid lease becomes the leader. The process is serialized by FoundationDB transactions, preventing split‑brain scenarios.

3. Detailed Architecture – Client

FUSE Client

Implemented with libfuse low‑level API (requires libfuse ≥ 3.16.1) and uses C++20 coroutines. Each request passes through kernel VFS, FUSE forwarding, and user‑space daemon, incurring four context switches and 1‑2 copies, which limits performance.

USRBIO Native Client

Uses a zero‑copy, asynchronous API based on shared‑memory ring buffers (Iov and Ior files). It eliminates kernel‑user transitions, achieving 3‑5× higher throughput than the FUSE client.

USRBIO details can be found in the API reference: https://github.com/deepseek-ai/3FS/blob/main/src/lib/api/UsrbIo.md

4. Detailed Architecture – Storage Service

CRAQ Replication

Data nodes form chains; writes propagate from the chain head to the tail, and the tail confirms writes. CRAQ’s write‑all‑read‑any property improves read performance, and the system supports dynamic stripe sizing to limit the number of chains involved for small files.

FFRecord File Format

FFRecord merges many small files into larger records to reduce open‑file overhead. It stores per‑sample offsets and CRC32 checksums for random access and data integrity.

Chunk Engine

Each storage node runs a Chunk Engine that manages chunk allocation (64 KiB–64 MiB) using bitmap‑based resource pools, persisting metadata in LevelDB/RocksDB. Writes use copy‑on‑write for modifications and atomic appends for tail writes.

Data Recovery

When a node fails, its targets are marked offline. Recovery proceeds by fetching remote metadata, comparing versions, and synchronizing missing chunks via full‑chunk replacement, ensuring consistency without service interruption.

5. Detailed Architecture – Metadata Service

The Meta Service stores inode and dentry information in FoundationDB, encoding keys with prefixes to simulate logical tables. It provides POSIX‑compatible operations, handling directory inheritance, garbage collection, and transaction‑based conflict resolution.

Metadata service architecture
Metadata service architecture

Conclusion

The article offers an in‑depth look at 3FS’s design choices, highlighting its high‑throughput architecture, RDMA‑based communication, CRAQ replication, and sophisticated client implementations, while noting differences from mainstream distributed file systems and setting the stage for future comparative analysis.

distributed file systemRDMAmetadata servicechain replicationhigh-performance storage
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.