Why RDMA Is Replacing TCP/IP for AI and High‑Performance Storage
The article analyzes how the AI boom and high‑performance SSD storage demand sub‑microsecond latency, exposing TCP/IP’s inherent context‑switch and CPU overhead, and explains why RDMA’s kernel‑bypass, zero‑copy design and 1 µs latency make it the preferred network stack for modern data‑center workloads despite challenges in Ethernet deployment.
Background
With the rapid rise of AI and the deployment of deep‑learning server clusters, high‑performance storage media such as SSDs impose stricter latency requirements on data‑center networks. Traditional TCP/IP stacks can no longer satisfy these ultra‑low‑latency needs.
Limitations of TCP/IP
TCP/IP processing introduces tens of microseconds of fixed latency because each packet incurs multiple kernel context switches (≈5‑10 µs each) and at least three memory copies, plus CPU‑bound protocol encapsulation. This overhead becomes a dominant bottleneck in microsecond‑scale AI computation and distributed SSD storage.
Beyond fixed latency, the TCP/IP stack forces the host CPU to repeatedly participate in memory copying. As network bandwidth scales (e.g., >25 Gbps), the CPU can spend more than half of its capacity merely moving data, leaving little for actual computation.
Advantages of RDMA
RDMA bypasses the kernel, allowing applications to read/write directly to the NIC, reducing end‑to‑end transmission latency to around 1 µs. Its zero‑copy mechanism lets the receiver fetch data directly from the sender’s memory, dramatically lowering CPU load and improving overall efficiency.
Industry tests from a major internet provider show that RDMA can boost computational efficiency by 6‑8×, and the 1 µs transmission latency enables SSD distributed storage to drop from millisecond‑level to microsecond‑level delays, making RDMA the default in the latest NVMe interface specifications.
Current RDMA Transport Options
Two main RDMA deployment models exist today:
InfiniBand : A closed, vendor‑specific architecture that cannot interoperate with existing IP Ethernet networks and risks vendor lock‑in.
RDMA over Ethernet (RoCE) : Leverages standard IP Ethernet but lacks robust loss‑recovery mechanisms; even a 2% packet loss can collapse throughput to zero, requiring loss rates below 0.001%.
Challenges with RDMA over Ethernet
To avoid packet loss, many vendors enable PFC (Priority Flow Control) and ECN (Explicit Congestion Notification). However, PFC can cause queue buildup and network deadlocks, while ECN’s back‑pressure only reduces sending rate without improving throughput.
Impact of Distributed Architectures
Distributed compute frameworks (Map/Reduce) generate two problematic traffic patterns:
Incast : Many-to-one traffic during the Reduce phase creates sudden bursts that exceed receiver capacity, causing congestion and packet loss.
Large‑packet exchanges : As model sizes grow (e.g., gigabyte‑scale tensors), inter‑node messages become very large, further aggravating congestion.
Both patterns increase network latency and amplify the need for a zero‑loss, low‑latency fabric.
Latency Components
Network latency consists of static and dynamic parts. Static latency (serialization, device forwarding, optical conversion) is typically sub‑nanosecond to microsecond and accounts for <1% of total delay. Dynamic latency—queueing and retransmission due to congestion and loss—dominates (>99%) and can reach sub‑second levels under heavy load.
In distributed workloads, the overall job completion time is dictated by the slowest flow; any congested flow inflates the total latency.
Future Directions: AI Fabric
Huawei’s AI Fabric claims to achieve “zero packet loss, ultra‑low latency, and maximum throughput” simultaneously by employing a proprietary congestion‑control algorithm that avoids the complex parameter tuning required by standard DCQCN.
Achieving all three goals remains challenging because they are inter‑dependent; however, the industry trend is clear: next‑generation data‑center networks must prioritize zero loss, sub‑microsecond latency, and high throughput to meet AI‑driven workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
