Industry Insights 27 min read

A Decade of RDMA: Lessons Learned from Protocol Evolution

The article reviews ten years of RDMA development, tracing its origins, the rise and pitfalls of RoCEv1/v2, alternative approaches like iWARP and Cisco usNIC, and recent modernizations such as AWS SRD, Google Falcon and UltraEthernet, highlighting why protocol design choices have repeatedly stalled industry progress.

Linux Code Review Hub

Apr 7, 2024

A Decade of RDMA: Lessons Learned from Protocol Evolution

First exposure to RDMA occurred in 2014 when testing a low‑latency trading network for Zhengzhou Commodity Futures Exchange using Cisco Userspace NIC (usNIC); RoCEv2 was released the same year.

1. Early RDMA and InfiniBand

RDMA was created in 1995 to address kernel‑bypass requirements. Werner Vogels’ thesis Scalable Cluster Technologies for Mission‑Critical Enterprise Computing [3] and Cornell slides High performance networking Unet and FaRM [4] document the early U‑Net architecture, which exposed kernel packet‑processing flaws. In 1997 the Virtual Interface Architecture (VIA) combined U‑Net with Remote DMA Service.

InfiniBand emerged in the early 2000s as a response to PCI‑X bandwidth limits, driven by Future I/O (Compaq/IBM/HP) and NGIO (Intel, Sun, Dell). The InfiniBand Trade Association (IBTA) aimed to replace host‑internal I/O, Ethernet, Fibre Channel and other cluster interconnects.

2. RDMA over Ethernet

2.1 iWARP

Starting in 2002 the IETF defined iWARP (Internet Wide Area RDMA Protocol) to run RDMA over the Internet. TCP was chosen as the transport to provide ordering, retransmission and congestion control. Direct Data Placement (DDP) for multipath support appeared in 2002 and was discussed in a 2007 SC07 paper by Panda [12].

2.2 RoCEv1

In 2010 RoCEv1 combined InfiniBand semantics with Ethernet link‑layer frames, omitting the IP header and reusing the UDP‑based U‑Net concept.

2.3 RoCEv2

After four years of effort RoCEv2 added IP/UDP headers in 2014 and adopted Priority Flow Control (PFC) to create a lossless Ethernet link. The design inherited Go‑Back‑N retransmission from host‑internal buses, a choice that later proved problematic for hyperscale deployments.

2.4 Cisco usNIC

Cisco released usNIC before RoCEv2. It used UDP with unreliable datagram semantics and implemented a sliding‑window/ACK/retransmission scheme. Measured latency was 1.57 µs (≈2 µs end‑to‑end) when paired with a Cisco 3548 switch that adds 190 ns latency. Technical details are in the Cisco blog “HPC in L3” [7] and a Lawrence Berkeley Lab talk [8].

3. Ten Years of RoCEv2 Issues

Microsoft, Google and Cray (HPE) Slingshot authors highlighted RoCEv2 problems in Datacenter Ethernet and RDMA: Issues at Hyperscale [9]. The main issues are:

Go‑Back‑N retransmission on Ethernet, which is sensitive to packet loss.

Reliance on PFC, which propagates congestion control across layers and reduces workload efficiency.

Inability to support multipath load‑balancing because of strict ordering requirements.

3.1 IRN (2018)

The paper Revisiting Network Support for RDMA [10] introduced selective retransmission instead of Go‑Back‑N and abandoned rigid PFC, but it did not adopt Cisco’s sliding‑window congestion control, leading to new rate‑based congestion‑control challenges.

3.2 Modernization Efforts

AWS SRD – a cloud‑optimized transport that uses a per‑connection dynamic rate limit with an in‑flight byte cap, similar to BBR but with datacenter‑aware multipath handling. Congestion is detected via RTT spikes on most paths; individual path congestion triggers independent rerouting. Details are in A Cloud‑Optimized Transport Protocol for Elastic and Scalable HPC [13].

Google Falcon – decouples transport, encryption and RDMA semantics, using separate PUSH/PULL primitives and a rate‑update engine for congestion control. Multipath support (PLB algorithm) is limited for large‑model training.

UltraEthernet – focuses on switch‑level random packet spray without clear integration of congestion signals, following the “Smart Edge, Dumb Core” principle.

Alibaba Solar RDMA – a compute‑to‑storage protocol for storage‑centric QP demands, described in From Luna to Solar [14].

4. Core Technical Takeaways

Historical analysis shows a progression from RoCEv1’s UDP‑only design, through RoCEv2’s lossless PFC‑driven approach, to recent loss‑aware designs such as usNIC and SRD. Modern protocols avoid the legacy mismatch between intra‑host memory‑semantic, low‑latency ordered communication and inter‑host message‑semantic, higher‑latency unordered communication.

Rigid, lossless designs (PFC) create scalability and latency problems; flexible, loss‑aware congestion control is preferable.

Removing unnecessary complexity (e.g., Go‑Back‑N, mandatory PFC) yields more robust protocols.

Edge intelligence (e.g., sliding‑window CC in usNIC, per‑connection rate limiting in SRD) combined with a simple core improves performance at hyperscale.

Understanding these technical lessons helps avoid repeating a decade of design missteps in future AI‑cluster networking.

Code example

第一性原理

RDMA Protocol Design RoCE Data Center Networking iWARP AI Accelerators

Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.