Industry Insights 17 min read

Why RoCE Is Reshaping High‑Performance Computing Networks

The article provides a detailed technical analysis of RoCE (RDMA over Converged Ethernet), its two protocol versions, packet overhead, congestion‑control mechanisms, Soft‑RoCE implementation, and the challenges and performance implications of deploying RoCE in modern HPC environments compared to InfiniBand and traditional Ethernet solutions.

Architects' Tech Alliance

May 9, 2024

Why RoCE Is Reshaping High‑Performance Computing Networks

Early HPC Network Landscape

In the early days of high‑performance computing (HPC), specialized networks such as Myrinet, Quadrics, and InfiniBand were preferred over Ethernet because they offered higher bandwidth, lower latency, and better congestion control. The introduction of the RoCE protocol by IBTA in 2010 (RoCE v1) and its 2014 update (RoCE v2) dramatically improved Ethernet performance, reviving interest in Ethernet‑based HPC solutions.

RoCE Protocol Overview

RoCE enables Remote Direct Memory Access (RDMA) over Ethernet, offloading data‑transfer tasks to the network adapter and reducing kernel‑mode overhead, copy operations, and latency. It also utilizes CPU resources more efficiently, improving bandwidth utilization.

Two versions exist:

RoCE v1 operates at the link layer (Layer 2) and requires both endpoints to be on the same Ethernet segment. Ethernet type 0x8915 identifies RoCE frames, but the lack of an IP header prevents routing beyond Layer 2.

RoCE v2 runs at the network layer (Layer 3) by encapsulating RoCE packets in UDP. It leverages IP DSCP and ECN fields for congestion control, making the packets routable across IP networks. In most discussions, “RoCE” refers to RoCE v2 unless otherwise specified.

RoCE v1 Details

RoCE v1 retains InfiniBand’s transport and network layers while replacing the link and physical layers with Ethernet. The Ethernet frame type for RoCE is 0x8915. Because RoCE v1 lacks an IP header, routing is impossible, limiting its use to a single Layer 2 domain.

RoCE v2 Enhancements

RoCE v2 adds UDP encapsulation, allowing the use of IP routing. It also adopts ECN‑based congestion control, enabling more scalable deployments. Consequently, references to “RoCE” usually imply RoCE v2.

Lossless Transmission and Congestion Control

RoCE traffic must be delivered without loss and in order. If a packet is lost or reordered, a “rollback‑N” retransmission is required, and subsequent packets should not be buffered.

RoCE implements a two‑stage congestion‑control scheme:

Initial slowdown using DCQCN (Data Center Quantized Congestion Notification).

Transmission pause using PFC (Priority Flow Control).

When a switch detects that the total pending buffer size on a port exceeds a threshold, it marks the ECN field in the RoCE packet’s IP header. The receiver sends a Congestion Notification Packet (CNP) back to the sender, prompting a rate reduction.

The marking probability depends on two parameters, Kmin and Kmax:

If queue length < Kmin, no packets are marked.

If Kmin ≤ queue length ≤ Kmax, packets are marked with increasing probability.

If queue length > Kmax, all packets are marked.

CNP packets are not sent for every marked packet; instead, the receiver aggregates marked packets over a time interval and sends a single CNP, allowing the sender to adjust its rate based on the number of received CNPs.

If the buffer occupancy reaches a higher threshold, the switch issues a PFC pause frame, halting transmission on the affected priority until congestion subsides.

Soft‑RoCE

When certain NICs lack native RoCE support, the open‑source Soft‑RoCE project (originating from collaborations between IBIV and Mellanox) provides a software implementation that enables RoCE communication on those devices. While Soft‑RoCE may not boost performance on unsupported NICs, it allows mixed‑environment clusters to interoperate and facilitates incremental upgrades.

Challenges Deploying RoCE in HPC

Fundamental HPC Network Requirements

HPC networks must deliver ultra‑low latency and maintain that latency under dynamic traffic patterns. RoCE addresses low latency by offloading network operations to the NIC, reducing CPU utilization.

However, maintaining low latency under fluctuating traffic places heavy demands on congestion‑control mechanisms, which can become performance bottlenecks.

Latency Comparison

Both InfiniBand and RoCE v2 bypass the kernel protocol stack, achieving significantly lower end‑to‑end latency than TCP/IP. Empirical tests show latency reductions from ~50 µs (TCP/IP) to ~5 µs (RoCE) and even ~2 µs (InfiniBand) within the same cluster.

RoCE Packet Overhead

For a 1‑byte payload, the additional overhead is:

RoCE over Ethernet : 14 B MAC header + 4 B CRC + 20 B IP header + 8 B UDP header + 12 B BTH = 58 bytes.

InfiniBand : 8 B LHR + 6 B CRC + 12 B BTH = 26 bytes.

Custom networks can further reduce overhead (e.g., Tianhe‑1A mini‑packet header of 8 bytes). The complexity of Ethernet’s lower layers is a key obstacle for RoCE adoption in HPC.

Congestion‑Control Challenges

PFC relies on pause frames, which can cause packet loss and low buffer utilization, especially on switches with limited buffers. Credit‑based flow control offers finer‑grained buffer management.

DCQCN, similar to InfiniBand’s congestion control, uses reverse‑path notifications (CNP) to inform the sender of congestion. RoCE follows a fixed set of slowdown/acceleration formulas, whereas InfiniBand allows custom strategies. The default CNP generation interval is up to 50 µs; InfiniBand can configure timers as low as 1.024 µs, though such settings are rarely realized in practice.

RoCE in Real‑World HPC Deployments

Modern U.S. supercomputers employ the Slingshot network, an enhanced Ethernet that integrates RoCE‑compatible Rosetta switches. Features include 32‑byte minimum IP frame size, shared queue occupancy information, and improved congestion control. Average switch latency is ~350 ns, comparable to high‑performance Ethernet switches and lower than many InfiniBand implementations.

Benchmarks with CESM and GROMACS on 25 GbE and 100 GbE links demonstrate that, despite a four‑fold bandwidth difference, performance remains comparable, highlighting the importance of low‑latency Ethernet for AI‑focused data centers.

Conclusion

Ethernet switches still exhibit higher latency than InfiniBand or custom HPC switches.

RoCE’s flow‑control and congestion‑control mechanisms require further optimization.

Ethernet switch costs remain relatively high.

As AI workloads demand ever‑faster data‑center networks, traditional TCP/IP is insufficient. RDMA technologies—particularly InfiniBand and RoCE—are becoming the preferred solutions for high‑performance, low‑latency networking, with RoCE offering greater deployment flexibility while still facing challenges in latency, congestion management, and cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

RDMA network protocol Congestion Control HPC InfiniBand RoCE Soft‑RoCE

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.