Why RoCEv2 Needs a Lossless Network and How to Achieve It
RoCE, originally built for InfiniBand, was adapted to Ethernet as RoCEv2, which uses IP/UDP headers to enable L3 routing but is highly sensitive to packet loss, requiring a lossless network and employing technologies such as PFC, ECN, DCQCN, and multi‑path transmission to maintain high RDMA performance.
RoCE
The RDMA protocol stack was first implemented on InfiniBand devices, but because InfiniBand dominated the market and was expensive, the RDMA Consortium was formed to port the RDMA stack to Ethernet, challenging the monopoly.
RoCEv2 Protocol Stack
RoCE evolved through two versions:
RoCEv1 : L2 uses the IEEE 802.3 Ethernet header, while L3 still uses the InfiniBand GRH header, limiting it to a single L2 LAN and a single VLAN.
RoCEv2 : L3 is encapsulated with IP/UDP headers, overcoming RoCEv1’s limitation and enabling cross‑subnet communication; UDP’s source‑port hash also solves ECMP load‑balancing issues.
Because of its cost‑performance advantage, RoCEv2 has become the mainstream RDMA implementation, supported by RNIC vendors such as NVIDIA/Mellanox and Intel.
RoCEv2 Requires a Lossless Network
RoCEv2 over RDMA is extremely sensitive to packet loss for two main reasons:
UDP is connectionless and unreliable; loss requires the application to detect and retransmit, reducing efficiency.
UDP lacks a sliding‑window and ACK mechanism, so a lost packet forces a go‑back‑N retransmission of subsequent packets, incurring high cost.
Even a loss rate above 0.001% can cause throughput to collapse; to achieve full‑speed RDMA, loss must be below 1e‑05 (ideally zero).
Thus RoCEv2 depends on a lossless network, and loss sources fall into two categories: network congestion and physical link failures.
Network Congestion
When traffic exceeds the processing and buffering capacity of routers, switches, or RNICs, packets are dropped.
Physical Link Failures
Physical disconnections cause transmission interruptions.
Lossless Network Congestion‑Control Techniques
Causes of Network Congestion
When the number of packets sent approaches device capacity, congestion and latency rise sharply.
In practice, buffers are expensive and bandwidth cannot be precisely predicted, leading to scenarios such as asymmetric bandwidth ratios, ECMP elephant flows, and TCP incast spikes.
Flow Control vs. Congestion Control
Flow Control : Local view, only concerns the two endpoints.
Congestion Control : Global view, ensures overall network quality; includes flow control but also addresses broader concerns such as zero loss, low latency, and high throughput.
PFC (Priority‑Based Flow Control)
IEEE 802.1Qbb PFC extends IEEE 802.3X FC by adding priority classification, allowing low‑value flows not to affect high‑value flows.
Eight virtual lanes (VL) each have a priority; RNICs map traffic to eight Tx queues (TC0‑7). The priority is set via the 3‑bit PCP field in the VLAN tag or combined with IP DSCP.
PFC Deadlock
When multiple switches become congested simultaneously and each waits for the other to release buffers, traffic can become permanently blocked.
Detection mechanisms monitor PAUSE frames and can release or drop buffered packets after a timeout.
PFC Storm
In large fabrics, repeated PAUSE frames can create a network‑wide storm, limiting scalability and complicating cloud‑disk deployments.
ECN (Explicit Congestion Notification)
ECN works at the IP layer: a congested switch marks the ECN bits, the receiver sends a Congestion Notification Packet (CNP) to the sender, which then reduces its sending rate.
ECN provides end‑to‑end congestion feedback, unlike PFC which requires every switch to participate.
ECN Congestion‑Control Lag
Because ECN markings and CNP messages must traverse multiple hops, there is at least one RTT of delay before the sender slows down, and CNP loss can further increase latency.
DCQCN (Data Center Quantized Congestion Notification)
Developed by Microsoft and Mellanox, DCQCN combines QCN and DCTCP, using ECN marking on switches and a rate‑based reduction algorithm on RNICs. It requires many configurable parameters but offers fairness and high bandwidth utilization.
Lossless Network Multi‑Path Transmission
Modern data centers use CLOS architectures with ECMP for load balancing, but ECMP is stateless and can cause “elephant flow” collisions, especially under failures.
Multi‑path RDMA splits traffic at the packet level, requiring lossless networks to avoid reordering issues. Research such as Microsoft’s “Multi‑Path Transport for RDMA in Datacenters” (NSDI 2018) proposes congestion‑aware flow splitting and out‑of‑order handling.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
