How WFS Ultra Achieved 200 Gbps TCP Throughput, Surpassing RDMA‑Based 3FS
The article details how the WFS Ultra project re‑engineered a traditional TCP‑based distributed file system with a Run‑To‑Completion thread model, ultra‑core async networking, full‑link zero‑copy, and load‑adaptive prefetch, enabling 200 Gbps Fio throughput that exceeds the RDMA‑accelerated 3FS benchmark.
Performance Overview
Using a single client node (T0‑CM6AX) and six identical server nodes in a 200 Gbps TCP environment, the WFS Ultra implementation saturated the NIC bandwidth with Fio. DirectIO and BufferIO tests across multiple block sizes showed consistent throughput advantages over the RDMA‑enabled 3FS implementation.
Key Optimizations
Run‑To‑Completion Thread Model
Traditional libfuse threads block on RPC calls, requiring a large thread pool and incurring high context‑switch overhead. WFS Ultra replaces this synchronous model with an asynchronous ultra‑core network component. Threads issue requests without waiting and receive completions via callbacks, allowing continuous task processing.
Ultra‑core Asynchronous Network Component
The libfuse read interface is converted to non‑blocking calls. An ultra‑core callback mechanism decouples request issuance from result handling, so a thread can immediately start a new request after submitting the previous one. This eliminates idle time and reduces CPU waste.
Consistent Core‑Binding Strategy
All worker threads are pinned to specific CPU cores. Fixed core affinity prevents cache‑misses caused by thread migration and makes the processing pipeline behave as if it were a single thread, improving overall efficiency.
Full‑Link Zero‑Copy Architecture
Pluggable Read/Write Framework
Multiple data‑transfer mechanisms were evaluated. The final design uses splice on the client side and sendfile on the server side, delivering the highest read throughput.
Client‑Side Zero‑Copy
Original data flow required five copies: kernel socket → userspace → deserialization → cache → fuse buffer → /dev/fuse. By employing splice with a custom protocol, the path is reduced to kernel socket → splice → /dev/fuse, eliminating all userspace copies.
Server‑Side Zero‑Copy
The server originally performed three copies: kernel disk → userspace → serialization → kernel socket. Switching to sendfile moves data directly from disk to NIC via DMA, removing all CPU‑bound copies.
Overall, eight CPU copies are eliminated, allowing the server to saturate 200 Gbps with only three CPU cores. The sendfile mode does not support real‑time CRC checks; periodic integrity inspections are used to mitigate data‑corruption risk.
Load‑Adaptive Prefetch
Prefetch Strategy
When sequential reads are detected, a ReadAhead operation preloads upcoming data into memory. A memory pool supplies fixed‑size buffers to avoid allocation jitter, and fine‑grained streaming prefetch maximizes throughput while keeping memory usage constant.
Load‑Adaptive Control
In high‑concurrency scenarios, prefetch can add extra CPU copies, especially for BufferIO where page‑cache copies stack with prefetch copies. The system assigns higher weight to BufferIO in bandwidth‑limiting counters and automatically disables prefetch under heavy load, preserving throughput.
Optimization Scenarios
Model‑Load Warm‑Up Overlap
Applying the same techniques to AI model loading, a warm‑up overlap strategy preloads model parameters while the framework initializes, dramatically reducing startup time for models of various sizes.
Asynchronous Random‑Read Acceleration
In high‑concurrency or high‑latency network environments, the asynchronous redesign yields more than a tenfold improvement in random‑read performance compared with the conventional implementation.
Conclusion
WFS Ultra demonstrates that a Run‑To‑Completion thread model, full‑link zero‑copy, and load‑adaptive prefetch enable a traditional TCP network to match or exceed the performance of the RDMA‑based 3FS, fully utilizing a 200 Gbps NIC without additional hardware. RDMA still offers advantages for latency‑sensitive KVCache workloads and ultra‑high‑throughput NICs, but the TCP‑centric approach provides broader applicability.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
