Fundamentals 23 min read

Why TCP Needs a Rethink: RDMA Insights and 800 Gbps Experiments

The talk examines the challenges of using standard Linux TCP for high‑performance data‑center workloads, explores how RDMA can provide zero‑copy and asynchronous kernel bypass, and presents experimental results from an FPGA‑based prototype that approaches 800 Gbps packet rates while highlighting congestion‑control and CPU‑utilization trade‑offs.

Linux Code Review Hub

Feb 20, 2024

Why TCP Needs a Rethink: RDMA Insights and 800 Gbps Experiments

The presenters, Shrijeet Mukherjee and David Ahern, begin by framing the discussion around the growing divergence between hardware and software in data‑center networking, noting that advances in ChatGPT and accelerators are reshaping communication patterns and load requirements.

They question whether TCP has fundamental limitations that require a radical redesign, comparing it with RDMA‑based solutions such as RoCE and iWARP. While RoCE satisfies many high‑performance needs, the speakers identify open problems like out‑of‑order delivery and selective acknowledgments that still hinder efficiency.

Moving from custom RDMA networks to more general‑purpose applications, they argue that TCP’s familiarity and debugging experience are valuable, yet its socket API and synchronous system‑call model impose significant overhead, especially for message‑oriented workloads.

The talk surveys prior attempts to improve TCP performance: iWARP (which relies on hardware TCP engines), user‑space stacks like OpenOnload, DPDK’s user‑space TCP, and newer mechanisms such as io_uring and af_xdp. Each approach either deviates from the standard Linux stack or lacks broad deployment.

They then describe their own FPGA‑based test platform: a standard Ryzen motherboard with an FPGA card that also hosts two GPUs, providing 100 Gb Ethernet (Gen 3 ×16 PCIe) but limited to ~40 Gb on the GPU path. The FPGA implements a hardware GRO/T‑SO solution and a modified driver, enabling experiments across a wide MTU range.

Performance measurements show that without GRO the throughput is poor, while software GRO improves it modestly. Their hardware GRO combined with the driver achieves near‑800 Gbps packet rates, especially at lower MTU sizes. The results also reveal that the socket API adds considerable overhead, whereas native SACK/ACK operations remain efficient.

They note that the Linux kernel continues to evolve, with frequent performance‑focused commits and congestion‑control algorithms like BBR improving fairness and buffer management. In machine‑learning and big‑data contexts, high link utilization and dynamic congestion control (e.g., eBPF) become critical.

The presenters discuss their hybrid design: they retain most existing Linux stack components, add a standard InfiniBand provider, and introduce a user‑space driver that submits scatter‑gather lists directly to the kernel’s TCP stack. On the receive side, packets are placed into pre‑registered buffers, enabling near‑zero‑copy transfers.

To address the inefficiency of placing messages on a TCP byte stream, they propose a small TCP option header resembling an InfiniBand BTH. This header carries operation stage and message identification fields, allowing hardware to map packets to queues without complex software parsing. While this option may be opaque to intermediate devices, it works for their data‑center use case.

Experimental data from the prototype shows latency of a few hundred microseconds on iPerf, and the ability to send three million 4 KB packets per second, effectively saturating a 100 Gb link. Compared with a Mellanox CX‑5 RoCE NIC, their solution lags in raw latency but matches or exceeds throughput for larger packet sizes (4 KB, 16 KB, 64 KB), especially when transferring data to GPU memory.

Further tests with stress‑ng induced DRAM contention demonstrate that CPU‑to‑CPU paths suffer performance drops, whereas CPU‑to‑GPU paths remain stable, highlighting the benefit of isolating the TCP stack from competing memory accesses.

In conclusion, by layering a lightweight RDMA‑style interface atop the Linux TCP stack and isolating the stack from application memory, they achieve significant performance gains without abandoning the familiar TCP ecosystem.

zero-copy RDMA Congestion Control High‑Performance Networking FPGA Linux TCP Kernel Bypass

Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.