Backend Development 10 min read

How WFS Ultra Achieved 200 Gbps TCP Throughput, Surpassing RDMA‑Based 3FS

The article details how the WFS Ultra project re‑engineered a traditional TCP‑based distributed file system with a Run‑To‑Completion thread model, ultra‑core async networking, full‑link zero‑copy, and load‑adaptive prefetch, enabling 200 Gbps Fio throughput that exceeds the RDMA‑accelerated 3FS benchmark.

Tencent Technical Engineering

Feb 11, 2026

How WFS Ultra Achieved 200 Gbps TCP Throughput, Surpassing RDMA‑Based 3FS

Performance Overview

Using a single client node (T0‑CM6AX) and six identical server nodes in a 200 Gbps TCP environment, the WFS Ultra implementation saturated the NIC bandwidth with Fio. DirectIO and BufferIO tests across multiple block sizes showed consistent throughput advantages over the RDMA‑enabled 3FS implementation.

Key Optimizations

Run‑To‑Completion Thread Model

Traditional libfuse threads block on RPC calls, requiring a large thread pool and incurring high context‑switch overhead. WFS Ultra replaces this synchronous model with an asynchronous ultra‑core network component. Threads issue requests without waiting and receive completions via callbacks, allowing continuous task processing.

Ultra‑core Asynchronous Network Component

The libfuse read interface is converted to non‑blocking calls. An ultra‑core callback mechanism decouples request issuance from result handling, so a thread can immediately start a new request after submitting the previous one. This eliminates idle time and reduces CPU waste.

Consistent Core‑Binding Strategy

All worker threads are pinned to specific CPU cores. Fixed core affinity prevents cache‑misses caused by thread migration and makes the processing pipeline behave as if it were a single thread, improving overall efficiency.

Full‑Link Zero‑Copy Architecture

Pluggable Read/Write Framework

Multiple data‑transfer mechanisms were evaluated. The final design uses splice on the client side and sendfile on the server side, delivering the highest read throughput.

Client‑Side Zero‑Copy

Original data flow required five copies: kernel socket → userspace → deserialization → cache → fuse buffer → /dev/fuse. By employing splice with a custom protocol, the path is reduced to kernel socket → splice → /dev/fuse, eliminating all userspace copies.

Server‑Side Zero‑Copy

The server originally performed three copies: kernel disk → userspace → serialization → kernel socket. Switching to sendfile moves data directly from disk to NIC via DMA, removing all CPU‑bound copies.

Overall, eight CPU copies are eliminated, allowing the server to saturate 200 Gbps with only three CPU cores. The sendfile mode does not support real‑time CRC checks; periodic integrity inspections are used to mitigate data‑corruption risk.

Load‑Adaptive Prefetch

Prefetch Strategy

When sequential reads are detected, a ReadAhead operation preloads upcoming data into memory. A memory pool supplies fixed‑size buffers to avoid allocation jitter, and fine‑grained streaming prefetch maximizes throughput while keeping memory usage constant.

Load‑Adaptive Control

In high‑concurrency scenarios, prefetch can add extra CPU copies, especially for BufferIO where page‑cache copies stack with prefetch copies. The system assigns higher weight to BufferIO in bandwidth‑limiting counters and automatically disables prefetch under heavy load, preserving throughput.

Optimization Scenarios

Model‑Load Warm‑Up Overlap

Applying the same techniques to AI model loading, a warm‑up overlap strategy preloads model parameters while the framework initializes, dramatically reducing startup time for models of various sizes.

Asynchronous Random‑Read Acceleration

In high‑concurrency or high‑latency network environments, the asynchronous redesign yields more than a tenfold improvement in random‑read performance compared with the conventional implementation.

Conclusion

WFS Ultra demonstrates that a Run‑To‑Completion thread model, full‑link zero‑copy, and load‑adaptive prefetch enable a traditional TCP network to match or exceed the performance of the RDMA‑based 3FS, fully utilizing a 200 Gbps NIC without additional hardware. RDMA still offers advantages for latency‑sensitive KVCache workloads and ultra‑high‑throughput NICs, but the TCP‑centric approach provides broader applicability.

network optimization distributed file system Run-to-Completion TCP performance WFS Ultra Zero-copy

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.