Fundamentals 12 min read

Inside 3FS: How Distributed File Systems Hide Complexity and Scale

3FS is an open‑source distributed file system that abstracts multiple machines into a single namespace, offering massive scalability, fault tolerance, and high throughput through components like Meta, Mgmtd, Storage, and Client, and leveraging the CRAQ protocol for strong consistency and efficient reads and writes.

Efficient Ops
Efficient Ops
Efficient Ops
Inside 3FS: How Distributed File Systems Hide Complexity and Scale

What is a Distributed File System?

A distributed file system (DFS) tricks applications into thinking they are interacting with a regular local file system, even though the data is spread across many machines. For example, the path /3fs/stage/notes.txt appears as a single file.

Running the same mkdir and cat commands on both a local and a distributed file system creates identical directories and files, demonstrating the abstraction.

Advantages of Distributed File Systems

Compared with local storage, DFS offers two major benefits: the ability to handle massive data volumes (up to petabytes) with high throughput, and built‑in fault‑tolerance and redundancy so the system continues operating even if machines or disks fail.

DFS is widely used in many real‑world scenarios, including:

Parallel processing frameworks (e.g., Spark’s HDFS)

Machine‑learning training pipelines with data loaders and checkpoints

Large internal code/data repositories such as Google’s Colossus

Travel‑industry applications

Photo‑storage services and similar businesses

Deep Dive into 3FS

DeepSeek’s open‑source 3FS implements a DFS with four main node types:

Meta – manages metadata such as file locations, attributes, and paths.

Mgmtd – controls cluster configuration, tracks active nodes, and maintains replication factors.

Storage – stores the actual file data on physical disks.

Client – communicates with all other nodes to read and modify the file system.

1. Mgmtd

Mgmtd tracks running nodes. Storage and meta nodes register at startup and send periodic heartbeats. It provides a centralized view of node health, allowing the system to detect failures quickly. Nodes discover each other by querying Mgmtd, which also stores the configuration of the replication chain (CRAQ).

2. Meta

The meta service handles typical file‑system operations (open, create, stat, unlink) via RPC. Metadata lives in inodes stored in FoundationDB, with DirEntry objects mapping paths to inodes. Inodes contain size, permissions, owner, timestamps, etc. Sessions are tracked so that open files can be recovered if a client disconnects.

3. Storage

Storage nodes break data into chunks. Each Chunk represents a physical disk block and tracks its ID, size, offset, checksum, and version. Workers such as AllocateWorker, PunchHoleWorker, and AioReadWorker manage allocation, reclamation, and reads (the latter using io_uring).

4. CRAQ (Chain Replication with Apportioned Queries)

CRAQ provides strong consistency and linearizability. Writes travel from the head of the chain to the tail, marking entries dirty until they reach the tail and become clean. Reads return immediately if the object is clean; otherwise they query the tail for the latest version.

Performance of CRAQ varies with workload. Write throughput and latency are limited by the slowest node in the chain, and read latency can increase under Zipfian workloads because many reads must reach the tail.

Using CRAQ in 3FS

In a typical 5‑node cluster with 5 SSD per node, data is replicated to three nodes to avoid overlap. If one node fails, the system loses only one‑third of total throughput, not a larger fraction. 3FS defaults to strong‑consistent reads, with writes flowing head‑to‑tail and acknowledgments propagating back.

Other Distributed File Systems

While many DFS share similar components (client, meta, storage, manager), they differ in workload suitability, tuning flexibility, deployment ease, throughput scaling, reliability, bottlenecks, fault‑tolerance algorithms, and hardware targeting.

The author plans to benchmark 3FS against single‑node systems and other DFS to evaluate performance, identify bottlenecks (CPU, memory, disk, network), and explore potential improvements.

fault tolerancestorage architecturedistributed file system3FSCRAQ
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.