Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Performance Evaluation
This article examines the RoCE protocol, its evolution and variants, lossless networking and congestion‑control mechanisms, practical performance measurements on HPC clusters, and the advantages and limitations of deploying RoCE in high‑performance computing environments.
Abstract
RoCE (RDMA over Converged Ethernet) enables remote memory access over Ethernet, dramatically reducing latency and improving bandwidth utilization compared with traditional TCP/IP. This article discusses the author's perspective on applying RoCE to HPC.
HPC Network Evolution and the Birth of RoCE
Early HPC systems used custom networks such as Myrinet, Quadrics, and InfiniBand to achieve higher bandwidth, lower latency, and better congestion control than Ethernet. In 2010 the IBTA released RoCE, followed by RoCEv2 in 2014, allowing Ethernet‑based high‑performance solutions to re‑enter the TOP500 rankings.
RoCE Protocol Overview
RoCE offloads packet processing to the NIC, avoiding kernel‑mode transitions and reducing copy overhead, which lowers latency and CPU usage. Two versions exist: RoCE v1 (link‑layer, confined to a single L2 domain) and RoCE v2 (network‑layer, routable across L3).
RoCE v1
Retains IB interfaces while replacing the link and physical layers with Ethernet. RoCE v1 packets use Ethertype 0x8915 and lack an IP header, preventing routing beyond the L2 domain.
RoCEv2
Replaces the IB network layer with Ethernet IP and UDP, using DSCP/ECN for congestion control, making the packets routable. In practice, references to “RoCE” usually mean RoCEv2.
Lossless Network and RoCE Congestion Control
RoCE requires lossless delivery; packets must arrive in order without loss, otherwise go‑back‑N retransmission occurs. Congestion control consists of two stages: DCQCN (Datacenter Quantized Congestion Notification) for rate reduction and PFC (Priority Flow Control) for pause‑frame based flow control.
When a switch’s output buffer exceeds a threshold, ECN bits are set; the receiver returns a CNP (Congestion Notification Packet) prompting the sender to slow down. ECN marking follows a probabilistic scheme between Kmin and Kmax. If congestion worsens, the switch sends PFC pause frames upstream, halting traffic on the affected priority until congestion eases.
RoCE and Soft‑RoCE
When hardware NICs lack RoCE support, the open‑source Soft‑RoCE project (IBM, Mellanox, etc.) provides a software implementation, allowing nodes with non‑RoCE NICs to communicate with RoCE‑enabled nodes, albeit without performance gains.
Applying RoCE to HPC: Issues
Core HPC Network Requirements
Two essential requirements are (1) ultra‑low latency and (2) sustained low latency under rapidly changing traffic patterns. RoCE addresses the first by offloading work to the NIC, but its congestion‑control mechanisms struggle with the second.
Latency Measurements
Using OSU Micro‑Benchmarks, latency of 25 Gb Ethernet RoCE (Mellanox ConnectX‑4 Lx) was compared with 100 Gb InfiniBand EDR (Mellanox ConnectX‑4). RoCE achieved roughly 5× lower latency than TCP but remained 47‑63 % slower than IB.
Official switch latency data show Ethernet switches (e.g., SN2410) around 300 ns, while IB switches (e.g., SB7800) are about 90 ns, indicating a persistent gap.
Packet Overhead
Encapsulating 1 byte of payload over RoCE incurs 58 bytes of overhead (Ethernet MAC + CRC, IP, UDP, IB BTH). By contrast, native InfiniBand adds only 26 bytes, highlighting the extra burden of Ethernet headers.
Congestion‑Control Limitations
PFC’s pause‑frame approach can under‑utilize buffers, especially on low‑latency switches, while DCQCN’s fixed reduction/increase formulas lack the flexibility of IB’s customizable algorithms.
HPC Use Cases
Slingshot
New exascale systems plan to use the Slingshot network, an enhanced Ethernet with Rosetta switches that support smaller minimum frames (32 bytes) and credit‑propagation‑based congestion control, achieving ~350 ns switch latency—still higher than IB or custom interconnects.
CESM and GROMACS Benchmarks
Latency‑optimized 25 Gb Ethernet was used to run climate (CESM) and molecular dynamics (GROMACS) workloads, showing a 4× bandwidth advantage over TCP, though absolute performance remains below IB.
Conclusion
Ethernet switches exhibit higher latency than IB or custom HPC interconnects.
RoCE’s flow‑control and congestion‑control mechanisms still have room for improvement.
Ethernet switch cost remains higher than IB solutions.
In small‑scale clusters, RoCE delivers acceptable performance, but its behavior at exascale remains untested. Emerging solutions like Slingshot modify Ethernet to mitigate RoCE’s shortcomings, but they are not pure Ethernet implementations.
References
https://en.wikipedia.org/wiki/Myrinet https://en.wikipedia.org/wiki/Quadrics_(company) https://www.nextplatform.com/2021/07/07/the-eternal-battle-between-infiniband-and-ethernet-in-hpc/ On the Use of Commodity Ethernet Technology in Exascale HPC Systems https://network.nvidia.com/pdf/prod_eth_switches/PB_SN2410.pdf Infiniband Architecture Specification 1.2.1 Tianhe-1A Interconnect and Message‑Passing Services https://fasionchan.com/network/ethernet/ Congestion Control for Large‑Scale RDMA Deployments An In‑Depth Analysis of the Slingshot Interconnect
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.