Inside 3FS: How Distributed File Systems Hide Complexity and Scale
3FS is an open‑source distributed file system that abstracts multiple machines into a single namespace, offering massive scalability, fault tolerance, and high throughput through components like Meta, Mgmtd, Storage, and Client, and leveraging the CRAQ protocol for strong consistency and efficient reads and writes.
What is a Distributed File System?
A distributed file system (DFS) tricks applications into thinking they are interacting with a regular local file system, even though the data is spread across many machines. For example, the path /3fs/stage/notes.txt appears as a single file.
Running the same mkdir and cat commands on both a local and a distributed file system creates identical directories and files, demonstrating the abstraction.
Advantages of Distributed File Systems
Compared with local storage, DFS offers two major benefits: the ability to handle massive data volumes (up to petabytes) with high throughput, and built‑in fault‑tolerance and redundancy so the system continues operating even if machines or disks fail.
DFS is widely used in many real‑world scenarios, including:
Parallel processing frameworks (e.g., Spark’s HDFS)
Machine‑learning training pipelines with data loaders and checkpoints
Large internal code/data repositories such as Google’s Colossus
Travel‑industry applications
Photo‑storage services and similar businesses
Deep Dive into 3FS
DeepSeek’s open‑source 3FS implements a DFS with four main node types:
Meta – manages metadata such as file locations, attributes, and paths.
Mgmtd – controls cluster configuration, tracks active nodes, and maintains replication factors.
Storage – stores the actual file data on physical disks.
Client – communicates with all other nodes to read and modify the file system.
1. Mgmtd
Mgmtd tracks running nodes. Storage and meta nodes register at startup and send periodic heartbeats. It provides a centralized view of node health, allowing the system to detect failures quickly. Nodes discover each other by querying Mgmtd, which also stores the configuration of the replication chain (CRAQ).
2. Meta
The meta service handles typical file‑system operations (open, create, stat, unlink) via RPC. Metadata lives in inodes stored in FoundationDB, with DirEntry objects mapping paths to inodes. Inodes contain size, permissions, owner, timestamps, etc. Sessions are tracked so that open files can be recovered if a client disconnects.
3. Storage
Storage nodes break data into chunks. Each Chunk represents a physical disk block and tracks its ID, size, offset, checksum, and version. Workers such as AllocateWorker, PunchHoleWorker, and AioReadWorker manage allocation, reclamation, and reads (the latter using io_uring).
4. CRAQ (Chain Replication with Apportioned Queries)
CRAQ provides strong consistency and linearizability. Writes travel from the head of the chain to the tail, marking entries dirty until they reach the tail and become clean. Reads return immediately if the object is clean; otherwise they query the tail for the latest version.
Performance of CRAQ varies with workload. Write throughput and latency are limited by the slowest node in the chain, and read latency can increase under Zipfian workloads because many reads must reach the tail.
Using CRAQ in 3FS
In a typical 5‑node cluster with 5 SSD per node, data is replicated to three nodes to avoid overlap. If one node fails, the system loses only one‑third of total throughput, not a larger fraction. 3FS defaults to strong‑consistent reads, with writes flowing head‑to‑tail and acknowledgments propagating back.
Other Distributed File Systems
While many DFS share similar components (client, meta, storage, manager), they differ in workload suitability, tuning flexibility, deployment ease, throughput scaling, reliability, bottlenecks, fault‑tolerance algorithms, and hardware targeting.
The author plans to benchmark 3FS against single‑node systems and other DFS to evaluate performance, identify bottlenecks (CPU, memory, disk, network), and explore potential improvements.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
