Fundamentals 18 min read

Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Case Studies

This article examines the RoCE protocol and its use in high‑performance computing, describing its low‑latency advantages, congestion‑control mechanisms, performance comparisons with InfiniBand, practical deployment issues, and real‑world case studies such as Slingshot and CESM/GROMACS benchmarks.

Architects' Tech Alliance

Sep 4, 2022

Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Case Studies

HPC Network Development and the Birth of RoCE

Early HPC systems relied on custom networks like Myrinet, Quadrics, and InfiniBand to achieve high bandwidth and low latency, while Ethernet was limited. The IBTA released RoCE (2010) and RoCEv2 (2014) standards, enabling Ethernet to compete with traditional HPC fabrics and maintain a significant share of the TOP500 clusters.

RoCE Protocol Overview

RoCE offloads RDMA operations to the NIC, eliminating kernel‑mode processing, reducing copy overhead, and lowering latency and CPU usage. Two versions exist: RoCE v1 (link‑layer, confined to a single L2 domain) and RoCE v2 (network‑layer using UDP, routable across L3).

RoCE v1

Uses Ethernet link‑layer with Ethertype 0x8915, no IP header, thus cannot be routed beyond a single L2 segment.

RoCE v2

Replaces the IB network layer with Ethernet IP and UDP, using DSCP/ECN for congestion control, making packets routable and improving scalability.

Lossless Network and RoCE Congestion‑Control Mechanisms

RoCE requires lossless delivery; packets must arrive in order without loss. Congestion control operates in two stages: DCQCN (Datacenter Quantized Congestion Notification) for rate reduction and PFC (Priority Flow Control) for pause frames.

When switch buffers exceed a threshold, ECN bits are marked; the receiver sends a CNP (Congestion Notification Packet) prompting the sender to slow down. Marking probability varies between Kmin and Kmax. If congestion worsens, PFC pause frames halt upstream traffic until buffers clear.

RoCE and Soft‑RoCE

For NICs lacking native RoCE support, the open‑source Soft‑RoCE project enables RDMA over Ethernet by implementing the protocol in software, allowing mixed‑hardware clusters to communicate via RoCE.

Issues When Applying RoCE to HPC

Core HPC Network Requirements

HPC networks need (1) ultra‑low latency and (2) the ability to maintain low latency under rapidly changing traffic patterns. RoCE addresses the first but its congestion‑control mechanisms struggle with the second.

Latency Measurements

Using OSU Micro‑Benchmarks, a 25 Gbps Ethernet (Mellanox ConnectX‑4 Lx) with RoCE was compared to 100 Gbps InfiniBand (ConnectX‑4). RoCE reduced latency ~5× versus TCP but remained 47‑63 % slower than IB.

Switch latency specifications show Ethernet switches (e.g., SN2410, SN3000) have higher base latency (300‑425 ns) compared to IB switches (90‑130 ns), contributing to the performance gap.

Packet Overhead

Encapsulating 1 byte of payload over RoCE incurs 58 bytes of overhead (Ethernet MAC 14 + CRC 4 + IP 20 + UDP 8 + IB BTH 12). By contrast, native IB requires only 26 bytes, highlighting Ethernet’s structural penalty.

Congestion‑Control Limitations

PFC’s pause‑frame approach can underutilize buffers, especially on low‑latency switches, while DCQCN’s fixed reduction/increase formulas lack the configurability of IB’s schemes.

RoCE in HPC: Application Cases

Slingshot

US supercomputers are adopting the Slingshot interconnect, an enhanced Ethernet that improves RoCE by reducing minimum frame size, propagating credit information, and offering better congestion control, achieving ~350 ns average switch latency.

CESM and GROMACS Benchmarks

Tests with 25 Gbps Ethernet vs. 100 Gbps IB on CESM and GROMACS showed performance differences, despite a 4× bandwidth gap, providing useful comparative data.

Summary and Conclusions

Ethernet switches exhibit higher latency than IB and custom HPC fabrics.

RoCE’s flow‑control and congestion‑control mechanisms still have room for improvement.

Ethernet switch costs remain higher than IB alternatives.

In small‑scale clusters, RoCE delivers acceptable performance, but its behavior at large scale remains untested. Emerging solutions like Slingshot suggest that pure Ethernet‑based RoCE may need further enhancements to match IB’s performance.

References

https://en.wikipedia.org/wiki/Myrinet https://en.wikipedia.org/wiki/Quadrics_(company) https://www.nextplatform.com/2021/07/07/the-eternal-battle-between-infiniband-and-ethernet-in-hpc/ On the Use of Commodity Ethernet Technology in Exascale HPC Systems https://network.nvidia.com/pdf/prod_eth_switches/PB_SN2410.pdf Infiniband Architecture Specification 1.2.1 Tianhe-1A Interconnect and Message‑Passing Services https://fasionchan.com/network/ethernet/ Congestion Control for Large‑Scale RDMA Deployments An In‑Depth Analysis of the Slingshot Interconnect

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Network Performance RDMA Congestion Control HPC Ethernet RoCE

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.