Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Case Studies
This article examines the RoCE protocol—an RDMA‑enabled Ethernet technology—its evolution, technical details, congestion‑control mechanisms, performance comparisons with InfiniBand, practical deployment issues in HPC clusters, and real‑world case studies such as Slingshot and application benchmarks.
Abstract
RoCE (RDMA over Converged Ethernet) is a cluster network communication protocol that enables RDMA on Ethernet, dramatically reducing latency and improving bandwidth utilization compared with traditional TCP/IP.
This article discusses the author’s perspective on applying RoCE to HPC.
Development of HPC Networks and the Birth of RoCE
Early HPC systems often used custom networks such as Myrinet, Quadrics, and InfiniBand instead of Ethernet because they offered higher bandwidth, lower latency, better congestion control, and specialized features.
IBTA released the RoCE protocol standard in 2010 and RoCEv2 in 2014, providing a high‑performance Ethernet‑compatible solution that has kept Ethernet a significant presence in the TOP500 HPC list.
Although Myrinet and Quadrics have disappeared, InfiniBand, Cray’s proprietary networks, Tianhe’s networks, and Tofu D series still play important roles.
RoCE Protocol Overview
RoCE enables RDMA over Ethernet by offloading packet processing to the NIC, eliminating kernel‑mode transitions, reducing copy overhead, and lowering CPU usage, which in turn reduces latency and improves bandwidth efficiency.
RoCE has two versions: RoCE v1 (link‑layer, confined to a single L2 domain) and RoCE v2 (network‑layer, routable across L3).
RoCE v1
RoCE v1 retains the InfiniBand transport and network interfaces while replacing the link and physical layers with Ethernet. The Ethernet Ethertype for RoCE is 0x8915. Because RoCE v1 lacks an IP header, its packets cannot be routed beyond a single L2 segment.
RoCEv2
RoCEv2 replaces the InfiniBand network layer with Ethernet IP and uses UDP for transport. It leverages the DSCP and ECN fields in the IP header for congestion control, making RoCEv2 packets routable and more scalable. In practice, “RoCE” usually refers to RoCEv2 unless explicitly stated otherwise.
Lossless Network and RoCE Congestion‑Control Mechanisms
RoCE requires a lossless fabric; any packet loss or reordering forces a go‑back‑N retransmission, which is undesirable for high‑performance workloads.
RoCE congestion control consists of two stages: DCQCN (Datacenter Quantized Congestion Notification) for rate reduction and PFC (Priority Flow Control) for pause‑based flow control.
When a switch’s output buffer exceeds a threshold, it marks the ECN field in the RoCE packet. The receiver sends a Congestion Notification Packet (CNP) back to the sender, prompting it to reduce its sending rate. ECN marking follows a probabilistic scheme based on two thresholds Kmin and Kmax.
If buffer occupancy grows further, the switch issues a PFC pause frame to upstream devices, halting transmission until congestion eases, after which a resume frame is sent. PFC operates per traffic class, allowing selective pausing.
RoCE and Soft‑RoCE
While most modern high‑performance Ethernet NICs support RoCE, some do not. Projects like Soft‑RoCE (by IBM, Mellanox, etc.) provide a software implementation that enables RoCE communication on non‑RoCE NICs, allowing mixed‑hardware clusters to interoperate.
Problems When Applying RoCE to HPC
Core Requirements of HPC Networks
HPC networks need (1) ultra‑low latency and (2) the ability to maintain low latency under rapidly changing traffic patterns.
RoCE addresses the first requirement by offloading work to the NIC, but its congestion‑control mechanisms struggle with the second, especially in highly dynamic traffic.
RoCE Low‑Latency Measurements
Latency tests comparing 25 Gb Ethernet (Mellanox ConnectX‑4 Lx) with 100 Gb InfiniBand (Mellanox ConnectX‑4) using OSU Micro‑Benchmarks show that RoCE reduces latency about five‑fold compared with TCP, yet remains 47‑63 % slower than InfiniBand.
RoCE Packet Overhead
Sending 1 byte over RoCE incurs 58 bytes of overhead (Ethernet MAC + CRC, IP, UDP, IB BTH). By contrast, native InfiniBand requires only 26 bytes, illustrating the extra burden Ethernet places on RoCE in HPC.
Congestion‑Control Issues
Both DCQCN and PFC have limitations: PFC’s pause‑based approach can underutilize buffers, especially on low‑latency switches, while DCQCN’s fixed rate‑adjustment formulas lack the configurability of InfiniBand’s congestion control.
RoCE Use Cases in HPC
Slingshot
US‑based next‑generation supercomputers plan to use the Slingshot interconnect, an enhanced Ethernet that improves RoCE’s shortcomings (e.g., reduced minimum frame size, credit‑based queue propagation, advanced congestion control). It achieves average switch latency of ~350 ns, comparable to high‑end Ethernet but still higher than InfiniBand or custom HPC fabrics.
CESM and GROMACS Benchmarks
Using the same 25 Gb Ethernet and 100 Gb InfiniBand, the author measured application‑level performance for CESM and GROMACS. Although bandwidth differs by a factor of four, the results provide some insight into real‑world impact.
Summary and Conclusions
Ethernet switches exhibit higher latency than InfiniBand or custom HPC networks.
RoCE’s flow‑control and congestion‑control mechanisms still have room for improvement.
Ethernet switch costs remain higher.
In small‑scale deployments, RoCE performance is acceptable, but large‑scale behavior remains untested. Emerging solutions like Slingshot suggest that pure RoCE may need further enhancements to meet exascale demands.
References
https://en.wikipedia.org/wiki/Myrinet https://en.wikipedia.org/wiki/Quadrics_(company) https://www.nextplatform.com/2021/07/07/the-eternal-battle-between-infiniband-and-ethernet-in-hpc/ On the Use of Commodity Ethernet Technology in Exascale HPC Systems https://network.nvidia.com/pdf/prod_eth_switches/PB_SN2410.pdf InfiniBand Architecture Specification 1.2.1 Tianhe‑1A Interconnect and Message‑Passing Services https://fasionchan.com/network/ethernet/ Congestion Control for Large‑Scale RDMA Deployments An In‑Depth Analysis of the Slingshot Interconnect
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.