Fundamentals 18 min read

Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Performance Evaluation

This article examines the RoCE protocol, its evolution and variants, lossless networking and congestion‑control mechanisms, practical performance measurements on HPC clusters, and the advantages and limitations of deploying RoCE in high‑performance computing environments.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Performance Evaluation

Abstract

RoCE (RDMA over Converged Ethernet) enables remote memory access over Ethernet, dramatically reducing latency and improving bandwidth utilization compared with traditional TCP/IP. This article discusses the author's perspective on applying RoCE to HPC.

HPC Network Evolution and the Birth of RoCE

Early HPC systems used custom networks such as Myrinet, Quadrics, and InfiniBand to achieve higher bandwidth, lower latency, and better congestion control than Ethernet. In 2010 the IBTA released RoCE, followed by RoCEv2 in 2014, allowing Ethernet‑based high‑performance solutions to re‑enter the TOP500 rankings.

RoCE Protocol Overview

RoCE offloads packet processing to the NIC, avoiding kernel‑mode transitions and reducing copy overhead, which lowers latency and CPU usage. Two versions exist: RoCE v1 (link‑layer, confined to a single L2 domain) and RoCE v2 (network‑layer, routable across L3).

RoCE v1

Retains IB interfaces while replacing the link and physical layers with Ethernet. RoCE v1 packets use Ethertype 0x8915 and lack an IP header, preventing routing beyond the L2 domain.

RoCEv2

Replaces the IB network layer with Ethernet IP and UDP, using DSCP/ECN for congestion control, making the packets routable. In practice, references to “RoCE” usually mean RoCEv2.

Lossless Network and RoCE Congestion Control

RoCE requires lossless delivery; packets must arrive in order without loss, otherwise go‑back‑N retransmission occurs. Congestion control consists of two stages: DCQCN (Datacenter Quantized Congestion Notification) for rate reduction and PFC (Priority Flow Control) for pause‑frame based flow control.

When a switch’s output buffer exceeds a threshold, ECN bits are set; the receiver returns a CNP (Congestion Notification Packet) prompting the sender to slow down. ECN marking follows a probabilistic scheme between Kmin and Kmax. If congestion worsens, the switch sends PFC pause frames upstream, halting traffic on the affected priority until congestion eases.

RoCE and Soft‑RoCE

When hardware NICs lack RoCE support, the open‑source Soft‑RoCE project (IBM, Mellanox, etc.) provides a software implementation, allowing nodes with non‑RoCE NICs to communicate with RoCE‑enabled nodes, albeit without performance gains.

Applying RoCE to HPC: Issues

Core HPC Network Requirements

Two essential requirements are (1) ultra‑low latency and (2) sustained low latency under rapidly changing traffic patterns. RoCE addresses the first by offloading work to the NIC, but its congestion‑control mechanisms struggle with the second.

Latency Measurements

Using OSU Micro‑Benchmarks, latency of 25 Gb Ethernet RoCE (Mellanox ConnectX‑4 Lx) was compared with 100 Gb InfiniBand EDR (Mellanox ConnectX‑4). RoCE achieved roughly 5× lower latency than TCP but remained 47‑63 % slower than IB.

Official switch latency data show Ethernet switches (e.g., SN2410) around 300 ns, while IB switches (e.g., SB7800) are about 90 ns, indicating a persistent gap.

Packet Overhead

Encapsulating 1 byte of payload over RoCE incurs 58 bytes of overhead (Ethernet MAC + CRC, IP, UDP, IB BTH). By contrast, native InfiniBand adds only 26 bytes, highlighting the extra burden of Ethernet headers.

Congestion‑Control Limitations

PFC’s pause‑frame approach can under‑utilize buffers, especially on low‑latency switches, while DCQCN’s fixed reduction/increase formulas lack the flexibility of IB’s customizable algorithms.

HPC Use Cases

Slingshot

New exascale systems plan to use the Slingshot network, an enhanced Ethernet with Rosetta switches that support smaller minimum frames (32 bytes) and credit‑propagation‑based congestion control, achieving ~350 ns switch latency—still higher than IB or custom interconnects.

CESM and GROMACS Benchmarks

Latency‑optimized 25 Gb Ethernet was used to run climate (CESM) and molecular dynamics (GROMACS) workloads, showing a 4× bandwidth advantage over TCP, though absolute performance remains below IB.

Conclusion

Ethernet switches exhibit higher latency than IB or custom HPC interconnects.

RoCE’s flow‑control and congestion‑control mechanisms still have room for improvement.

Ethernet switch cost remains higher than IB solutions.

In small‑scale clusters, RoCE delivers acceptable performance, but its behavior at exascale remains untested. Emerging solutions like Slingshot modify Ethernet to mitigate RoCE’s shortcomings, but they are not pure Ethernet implementations.

References

https://en.wikipedia.org/wiki/Myrinet https://en.wikipedia.org/wiki/Quadrics_(company) https://www.nextplatform.com/2021/07/07/the-eternal-battle-between-infiniband-and-ethernet-in-hpc/ On the Use of Commodity Ethernet Technology in Exascale HPC Systems https://network.nvidia.com/pdf/prod_eth_switches/PB_SN2410.pdf Infiniband Architecture Specification 1.2.1 Tianhe-1A Interconnect and Message‑Passing Services https://fasionchan.com/network/ethernet/ Congestion Control for Large‑Scale RDMA Deployments An In‑Depth Analysis of the Slingshot Interconnect

performancenetworkRDMAHPCEthernetRoCE
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.