Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Case Studies
This article examines the RoCE protocol and its use in high‑performance computing, describing its low‑latency advantages, congestion‑control mechanisms, performance comparisons with InfiniBand, practical deployment issues, and real‑world case studies such as Slingshot and CESM/GROMACS benchmarks.
HPC Network Development and the Birth of RoCE
Early HPC systems relied on custom networks like Myrinet, Quadrics, and InfiniBand to achieve high bandwidth and low latency, while Ethernet was limited. The IBTA released RoCE (2010) and RoCEv2 (2014) standards, enabling Ethernet to compete with traditional HPC fabrics and maintain a significant share of the TOP500 clusters.
RoCE Protocol Overview
RoCE offloads RDMA operations to the NIC, eliminating kernel‑mode processing, reducing copy overhead, and lowering latency and CPU usage. Two versions exist: RoCE v1 (link‑layer, confined to a single L2 domain) and RoCE v2 (network‑layer using UDP, routable across L3).
RoCE v1
Uses Ethernet link‑layer with Ethertype 0x8915, no IP header, thus cannot be routed beyond a single L2 segment.
RoCE v2
Replaces the IB network layer with Ethernet IP and UDP, using DSCP/ECN for congestion control, making packets routable and improving scalability.
Lossless Network and RoCE Congestion‑Control Mechanisms
RoCE requires lossless delivery; packets must arrive in order without loss. Congestion control operates in two stages: DCQCN (Datacenter Quantized Congestion Notification) for rate reduction and PFC (Priority Flow Control) for pause frames.
When switch buffers exceed a threshold, ECN bits are marked; the receiver sends a CNP (Congestion Notification Packet) prompting the sender to slow down. Marking probability varies between Kmin and Kmax. If congestion worsens, PFC pause frames halt upstream traffic until buffers clear.
RoCE and Soft‑RoCE
For NICs lacking native RoCE support, the open‑source Soft‑RoCE project enables RDMA over Ethernet by implementing the protocol in software, allowing mixed‑hardware clusters to communicate via RoCE.
Issues When Applying RoCE to HPC
Core HPC Network Requirements
HPC networks need (1) ultra‑low latency and (2) the ability to maintain low latency under rapidly changing traffic patterns. RoCE addresses the first but its congestion‑control mechanisms struggle with the second.
Latency Measurements
Using OSU Micro‑Benchmarks, a 25 Gbps Ethernet (Mellanox ConnectX‑4 Lx) with RoCE was compared to 100 Gbps InfiniBand (ConnectX‑4). RoCE reduced latency ~5× versus TCP but remained 47‑63 % slower than IB.
Switch latency specifications show Ethernet switches (e.g., SN2410, SN3000) have higher base latency (300‑425 ns) compared to IB switches (90‑130 ns), contributing to the performance gap.
Packet Overhead
Encapsulating 1 byte of payload over RoCE incurs 58 bytes of overhead (Ethernet MAC 14 + CRC 4 + IP 20 + UDP 8 + IB BTH 12). By contrast, native IB requires only 26 bytes, highlighting Ethernet’s structural penalty.
Congestion‑Control Limitations
PFC’s pause‑frame approach can underutilize buffers, especially on low‑latency switches, while DCQCN’s fixed reduction/increase formulas lack the configurability of IB’s schemes.
RoCE in HPC: Application Cases
Slingshot
US supercomputers are adopting the Slingshot interconnect, an enhanced Ethernet that improves RoCE by reducing minimum frame size, propagating credit information, and offering better congestion control, achieving ~350 ns average switch latency.
CESM and GROMACS Benchmarks
Tests with 25 Gbps Ethernet vs. 100 Gbps IB on CESM and GROMACS showed performance differences, despite a 4× bandwidth gap, providing useful comparative data.
Summary and Conclusions
Ethernet switches exhibit higher latency than IB and custom HPC fabrics.
RoCE’s flow‑control and congestion‑control mechanisms still have room for improvement.
Ethernet switch costs remain higher than IB alternatives.
In small‑scale clusters, RoCE delivers acceptable performance, but its behavior at large scale remains untested. Emerging solutions like Slingshot suggest that pure Ethernet‑based RoCE may need further enhancements to match IB’s performance.
References
https://en.wikipedia.org/wiki/Myrinet https://en.wikipedia.org/wiki/Quadrics_(company) https://www.nextplatform.com/2021/07/07/the-eternal-battle-between-infiniband-and-ethernet-in-hpc/ On the Use of Commodity Ethernet Technology in Exascale HPC Systems https://network.nvidia.com/pdf/prod_eth_switches/PB_SN2410.pdf Infiniband Architecture Specification 1.2.1 Tianhe-1A Interconnect and Message‑Passing Services https://fasionchan.com/network/ethernet/ Congestion Control for Large‑Scale RDMA Deployments An In‑Depth Analysis of the Slingshot Interconnect
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.