Cloud Native 15 min read

DeepSeek 3FS Network Communication Module: Design, Implementation, and Impact on AI Infrastructure

This article provides an in‑depth analysis of DeepSeek's open‑source 3FS distributed storage system, focusing on its network communication module, RDMA‑based design, core classes such as IBSocket, Listener, and IOWorker, and how these innovations advance high‑performance AI infrastructure.

AntData
AntData
AntData
DeepSeek 3FS Network Communication Module: Design, Implementation, and Impact on AI Infrastructure

On February 28, 2025 DeepSeek open‑sourced its disruptive distributed file system Fire‑Flyer 3FS, redefining performance boundaries for distributed storage. Based on DeepSeek's technical report and source code, this article deeply analyzes the network communication module and its significance for AI infrastructure.

3FS clusters consist of 180 storage nodes, each equipped with dual 200 Gbps InfiniBand NICs and 16 × 14 TiB NVMe SSDs, achieving an aggregated read throughput of about 6.6 TiB/s. The system comprises four core components—Cluster Manager, Metadata Service, Storage Service, and Client—all interconnected via RDMA.

Communication Core Class Analysis

The network communication code resides in src/common/net , with related components spread across other directories such as src/core/storage . 3FS supports both RDMA and TCP, but this analysis concentrates on the RDMA path, which uses InfiniBand and also supports RoCEv2.

IBSocket Class

IBSocket handles socket‑level logic, containing several internal structures and RDMA‑related members. Important methods include rdmaRead and rdmaWrite , which wrap RDMA operations into a batch ( rdmaBatch ) and post them asynchronously.

RDMA Device Management

The IBDevice class manages RDMA devices, performing device queries, opening, Protection Domain allocation, and attribute retrieval.

RDMA Connection Establishment

Clients initiate connections via IBSocket::connect , creating a QP with verbs and then invoking IBConnect<>::connect() . Servers accept connections with IBSocket::accept , creating and initializing QPs before marking them READY.

An optimization sets attr.sq_sig_all = 0 in IBSocket::qpCreate , reducing CQE generation by signalling only on WRs with IBV_SEND_SIGNAL .

Listener Class

The Listener’s setup() loops over all NICs, adds them to addressList_ , obtains an EventBase thread from the pool, creates a socket using folly’s blockingWait , and stores it in serverSockets_ . Each NIC thus has a dedicated listening socket, and accepted connections are added to the corresponding IOWorker.

IOWorker Class

IOWorker processes all I/O tasks. Each ServiceGroup contains a Listener and an IOWorker; when a Listener accepts an RDMA connection, the associated IBSocket is inserted into the IOWorker, which then creates a Transport and registers it in the TransportPool.

EventLoop Class

EventLoop provides the main loop, notifying EventHandler callbacks when file descriptors become ready. Handlers such as Transport, IBDevice, IBSocket, and IBSocketManager inherit from EventLoop::EventHandler . The loop continuously calls handle_events after epoll_wait .

Network Resource Management Module

RDMA memory must be pre‑allocated and registered. 3FS uses a decentralized RDMABufPool that negotiates buffers on‑the‑fly, avoiding a central metadata service. This differs from Mooncake’s TransferEngine, which requires upstream registration of source and target buffers.

Both client and server allocate buffers via RDMABufPool , returning a RDMABuf wrapped in a CoTask for asynchronous execution.

RDMA I/O Path Flow

The RDMA I/O path consists of data‑send preparation, data‑receive handling, and verbs encapsulation. IBSocket::rdmaBatch() creates RDMAPostCtx objects, batches requests based on wrPerPost , and uses C++20 std::span to avoid copying request data.

Send side uses single‑sided RDMA_WRITE / RDMA_READ . Completion is polled via IBSocket::cqPoll() . Errors only set socket state to Error without retry logic, suggesting a possible robustness improvement.

Write Data I/O Flow

Writes follow a CRAQ chain‑replication model: the first hop (Client → ChunkServer) uses ReliableUpdate , subsequent hops use ReliableForwarding . The process involves an in‑place read‑modify‑write where a pending version coexists with the committed version until acknowledgment.

Potential optimizations include replacing libaio with user‑space SPDK for NVMe and exploring NVMe‑over‑Fabric, though architectural constraints of chain‑replication may limit benefits.

Read Data I/O Flow

Reads are batched via StorageOperator::batchRead . The client issues an RDMA write request to pull data from the server, which is managed by BatchReadJob and processed by AioReadWorker . The read buffer is an RDMA buffer, and the job encodes offset, length, and key using the Serde service.

Folly Coroutine Usage in 3FS

3FS extensively employs the folly coroutine library. For example, IBSocket::rdmaBatch splits requests into batches, each represented by a RDMAPostCtx containing a folly::coro::Baton for synchronization. After posting a WR, the coroutine suspends until the completion queue processing coroutine posts the baton.

This stack‑less coroutine approach demands rigorous engineering discipline, as any I/O‑bound operation must be coroutine‑compatible, raising the bar for debugging and production‑grade reliability.

Conclusion

The analysis shows that 3FS’s network communication module incorporates numerous performance‑critical optimizations, from RDMA‑centric design and multi‑NIC parallelism to coroutine‑driven asynchronous processing. Its open‑source release provides a high‑quality reference for building high‑performance storage systems that serve large‑model AI workloads.

high-performanceDistributed StorageAI infrastructureRDMANetwork CommunicationFolly Coroutines
AntData
Written by

AntData

Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.