Analysis of DeepSeek 3FS Storage Service Architecture and Design
This article provides an in‑depth technical analysis of DeepSeek's open‑source 3FS distributed file system, focusing on the StorageService architecture, space pooling, allocation mechanisms, reference counting, fragmentation handling, and the RDMA‑based read/write data path.
Architecture and Overall Position
The diagram highlights StorageService as the core component of the analysis. ChunkStorage offers three basic functions: single‑node space pooling, RDMA‑based communication links for data‑plane I/O, and support for chained replication to enhance fault tolerance and data consistency. This article concentrates on the first two aspects.
Space Pooling
One‑sentence summary: It is responsible for space pooling, providing tunable utilization and performance.
Pooling Concept
In 3FS, a chunk is the logical link between a file's LBA and the pooled storage mapping. Each file's LBA maps to a sequence of chunks identified by a monotonically increasing {seq} . Storage pooling flattens underlying storage so that the client can split files into adjustable‑size chunks, allowing StorageService to minimize fragmentation while maximizing performance.
Space Management
A physical disk maps one‑to‑many to logical StorageTarget s, each with a dedicated ChunkEngine handling space and I/O management. Treating a disk and all its associated targets as an abstract DiskUnit yields the high‑level relationship shown in the diagram.
Drilling down to ChunkEngine , the core includes an Allocator module for chunk space and the data‑plane interface. The discussion below focuses on the allocator logic.
The allocator interacts with two data types: user Data stored in the file system and metadata MetaData kept in RocksDB.
Space Allocation
The following diagram shows the overall flow for allocating chunks within the ChunkEngine .
Allocator consists of 11 instances handling chunk sizes from 64 KB to 64 MB.
Each allocator reserves 256 files; a file groups 256 chunks indexed by a bitmap.
Group allocation order (illustrated) balances file size and ensures locality; the engine calls Allocator::allocate , which may invoke ChunkAllocator and GroupAllocator as needed.
Chunk allocation steps: Select the appropriate allocator based on the requested size. Pick the least‑free allocated group (or allocate a new one) and select a chunk within it. Update the chunk_id → chunk mapping in the engine and persist allocation info.
Understanding
1. Steps (a) and (b) aim to reduce internal fragmentation, similar to OS slab allocation; (b) greedily fills holes within a group.
2. The strategy works best when the upper‑layer chunk slicing pattern aligns with the allocator; otherwise, on‑demand allocation may be used, albeit with potential performance spikes.
3. Reference counting ensures that meta‑service indexes remain stable even when underlying chunk locations change.
4. Each Target is a self‑contained resource with its own RocksDB instance.
Reference Counting
When a block is allocated and accessed, the engine calls Allocator::reference , incrementing the position reference count in ChunkAllocator . When the block is no longer needed, Allocator::dereference decrements the count, and the block is freed once the count reaches zero.
If a Group becomes empty, GroupAllocator recycles it.
Fragmentation Reclamation
When many chunks become obsolete (e.g., deletions or overwrites), groups with high free space are identified, their live chunks are moved to other groups, and the cleaned groups transition from a frozen to an active state. This background task consolidates space but incurs additional traffic.
When a group's reference count drops to zero, the group can be reclaimed via a "punch‑hole" operation.
Data Plane
The data‑plane is fully RDMA‑based; reads and writes are performed via a custom net component.
Interfaces
Three main entry points handle I/O requests, all of which ultimately target a chunk via the ChunkEngine bound one‑to‑one with a Target . The StorageOperator resolves req → Target using the request's vChainId , which maps to a specific target based on the chained‑replication group.
Data Transfer
All data‑plane communication uses RDMA. The read flow consists of sending a read request from StorageOperator , reading the target chunk via AioReadWorker , placing data into a buffer batch, and transmitting it to the client with net::RDMATransmission .
Write Flow Overview
Writes involve chain replication: the client sends a write request to the chain head, which propagates through the chain to the tail. The process includes client‑to‑operator submission, RDMA read of remote buffers, DIO write via ChunkEngine , forwarding the buffer to the next peer, and repeating the steps for subsequent peers.
Design Points
Zero‑Copy
3FS shares I/O buffers with network buffers (RDMABuf) to achieve zero‑copy, though a single copy occurs when moving from user‑space RDMABuf to kernel DIO buffers.
Copy‑On‑Write
Random writes trigger copy‑on‑write: a new chunk is allocated, the old chunk is read‑modified‑written, and metadata is copied, which can cause write amplification under heavy random‑write workloads.
Flow Control
Read requests carry only offset and length metadata; actual data transfer occurs after RDMA flow control. Write requests similarly perform flow control before issuing RDMA reads. Both rely on a request‑to‑send control mechanism.
Write and Space Amplification
Configurable chunk sizes per directory allow matching I/O patterns, mitigating write and space amplification. However, fragmentation and SSD internal write amplification still affect overall efficiency.
References
[1] 3FS design notes https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md
AntData
Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.