Operations 21 min read

Inside 3FS: How DeepSeek’s Parallel File System Powers AI Training

This article dives deep into DeepSeek's 3FS parallel file system, detailing its four-component architecture, RDMA‑based high‑speed networking, client options, metadata and storage services, replication protocols, dynamic stripe sizing, and recovery mechanisms that enable efficient AI model training and inference.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
Inside 3FS: How DeepSeek’s Parallel File System Powers AI Training

3FS Overall Architecture

Similar to many distributed file systems, 3FS consists of four parts: Cluster Manager, Client, Meta Service, and Storage Service, all interconnected via RDMA (InfiniBand) for high‑speed communication.

Cluster Manager

Acts as the central controller, handling node management with multi‑node hot‑standby for high availability, using FoundationDB‑based leader election, and coordinating heartbeats and client state recovery.

Client

Provides two access methods: a user‑friendly FUSE client (hf3fs_fuse) supporting POSIX interfaces with lower performance, and a native USRBIO client offering SDK‑style integration and 3‑5× higher throughput by eliminating kernel‑user context switches and using zero‑copy RDMA buffers.

Meta Service

Offers metadata services with a compute‑storage separation design, persisting metadata in FoundationDB transactions to support POSIX directory semantics; the service is stateless and horizontally scalable.

Storage Service

Delivers data storage using an integrated compute‑storage model: each node manages local SSDs, stores three replicas via the CRAQ chain‑replication protocol (write‑all‑read‑any), and distributes data chunks across nodes for load balancing.

Cluster Management Details

A 3FS cluster can have one or multiple manager nodes (mgmtd), with a single active leader and backups. Nodes report heartbeats and lease information to the leader, which persists node metadata in FoundationDB to survive leader switches.

Client Architecture

The FUSE client relies on libfuse low‑level API with C++20 coroutines, while USRBIO uses shared‑memory ring buffers for zero‑copy, asynchronous communication, similar to DPDK or io_uring designs.

USRBIO Implementation

Each USRBIO instance uses an Iov file for data buffers and an Ior file for I/O rings.

Shared memory files are symlinked under /3fs-virt/iovs/ to /dev/shm.

Three submit semaphore files manage I/O priorities and notify the FUSE daemon.

Symlink “Black Magic”

Non‑standard operations are implemented via symlink handling, enabling actions like recursive delete, Iov/Ior creation, and config setting without custom ioctl tools.

FFRecord File Format

To mitigate small‑file performance issues, 3FS defines the FFRecord format, which merges many small files, supports random batch reads, and includes CRC32 checksums for data integrity.

Storage Service Architecture

Designed for high throughput, the system scales linearly with SSD and network bandwidth, using CRAQ chain replication for reliability and a dynamic stripe size mechanism to reduce unnecessary storage node communication for small files.

Write and Read Workflows

Writes propagate from the client through the chain head to tail, with acknowledgments flowing back; reads can be served by any node in the chain after version checks, improving read parallelism.

Chunk Engine

Manages chunk files, allocation, and metadata (LevelDB/RocksDB), supporting copy‑on‑write and append‑only writes to avoid write amplification.

Data Recovery

When a storage node fails, it is marked offline; the recovery process fetches remote metadata, synchronizes missing chunks, and writes them to the recovered node using full‑chunk replacement, allowing concurrent writes and recovery.

Metadata Service

Built on FoundationDB, providing transactional KV storage with strong ACID guarantees; Meta Service translates POSIX operations into KV transactions, ensuring consistency and scalability.

High Performancedistributed file systemRDMAAI trainingmetadata serviceparallel storage
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.