Industry Insights 11 min read

Why RDMA Is Replacing TCP/IP for AI and High‑Performance Storage

The article analyzes how the AI boom and high‑performance SSD storage demand sub‑microsecond latency, exposing TCP/IP’s inherent context‑switch and CPU overhead, and explains why RDMA’s kernel‑bypass, zero‑copy design and 1 µs latency make it the preferred network stack for modern data‑center workloads despite challenges in Ethernet deployment.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Why RDMA Is Replacing TCP/IP for AI and High‑Performance Storage

Background

With the rapid rise of AI and the deployment of deep‑learning server clusters, high‑performance storage media such as SSDs impose stricter latency requirements on data‑center networks. Traditional TCP/IP stacks can no longer satisfy these ultra‑low‑latency needs.

Limitations of TCP/IP

TCP/IP processing introduces tens of microseconds of fixed latency because each packet incurs multiple kernel context switches (≈5‑10 µs each) and at least three memory copies, plus CPU‑bound protocol encapsulation. This overhead becomes a dominant bottleneck in microsecond‑scale AI computation and distributed SSD storage.

Beyond fixed latency, the TCP/IP stack forces the host CPU to repeatedly participate in memory copying. As network bandwidth scales (e.g., >25 Gbps), the CPU can spend more than half of its capacity merely moving data, leaving little for actual computation.

Advantages of RDMA

RDMA bypasses the kernel, allowing applications to read/write directly to the NIC, reducing end‑to‑end transmission latency to around 1 µs. Its zero‑copy mechanism lets the receiver fetch data directly from the sender’s memory, dramatically lowering CPU load and improving overall efficiency.

Industry tests from a major internet provider show that RDMA can boost computational efficiency by 6‑8×, and the 1 µs transmission latency enables SSD distributed storage to drop from millisecond‑level to microsecond‑level delays, making RDMA the default in the latest NVMe interface specifications.

Current RDMA Transport Options

Two main RDMA deployment models exist today:

InfiniBand : A closed, vendor‑specific architecture that cannot interoperate with existing IP Ethernet networks and risks vendor lock‑in.

RDMA over Ethernet (RoCE) : Leverages standard IP Ethernet but lacks robust loss‑recovery mechanisms; even a 2% packet loss can collapse throughput to zero, requiring loss rates below 0.001%.

Challenges with RDMA over Ethernet

To avoid packet loss, many vendors enable PFC (Priority Flow Control) and ECN (Explicit Congestion Notification). However, PFC can cause queue buildup and network deadlocks, while ECN’s back‑pressure only reduces sending rate without improving throughput.

Impact of Distributed Architectures

Distributed compute frameworks (Map/Reduce) generate two problematic traffic patterns:

Incast : Many-to-one traffic during the Reduce phase creates sudden bursts that exceed receiver capacity, causing congestion and packet loss.

Large‑packet exchanges : As model sizes grow (e.g., gigabyte‑scale tensors), inter‑node messages become very large, further aggravating congestion.

Both patterns increase network latency and amplify the need for a zero‑loss, low‑latency fabric.

Latency Components

Network latency consists of static and dynamic parts. Static latency (serialization, device forwarding, optical conversion) is typically sub‑nanosecond to microsecond and accounts for <1% of total delay. Dynamic latency—queueing and retransmission due to congestion and loss—dominates (>99%) and can reach sub‑second levels under heavy load.

In distributed workloads, the overall job completion time is dictated by the slowest flow; any congested flow inflates the total latency.

Future Directions: AI Fabric

Huawei’s AI Fabric claims to achieve “zero packet loss, ultra‑low latency, and maximum throughput” simultaneously by employing a proprietary congestion‑control algorithm that avoids the complex parameter tuning required by standard DCQCN.

Achieving all three goals remains challenging because they are inter‑dependent; however, the industry trend is clear: next‑generation data‑center networks must prioritize zero loss, sub‑microsecond latency, and high throughput to meet AI‑driven workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

network architectureTCP/IPdistributed storageLow latencyRDMAAI computingData Center Network
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.