How veRoCE Extends RoCEv2 to Tolerate Loss and Reordering without Lossless Networks
veRoCE is a ByteDance‑engineered RDMA protocol that retains RoCEv2 compatibility while adding multi‑path transmission, native out‑of‑order delivery, efficient loss detection and selective retransmission, allowing high‑performance clusters to operate over lossy Ethernet without PFC.
#01 Design Overview
The mainstream RoCEv2 high‑speed networks suffer two critical limitations: they rely on PFC lossless Ethernet, which easily destabilises large‑scale deployments, and they lack multipath support, causing ECMP conflicts and bandwidth waste.
ByteDance announced the veRoCE transport protocol at the Volcano Engine Force conference. veRoCE keeps RoCEv2‑compatible semantics and user interfaces while introducing a series of extensions that let RDMA tolerate packet loss and reordering, thus eliminating the dependence on lossless networks.
Core Features
Multi‑Path Transmission
Native Out‑of‑Order Delivery
Efficient Packet‑Loss Detection
Hardware‑Friendly Selective Retransmission
Independent PSN Space
Flexible Congestion Control (FCC) Framework
#02 Transport Header Extensions
veRoCE adds several extension headers to the RoCEv2 packet format and refines the checksum mechanism to support the new functions.
Independent UDP Port : Uses an IANA‑assigned UDP port (expected 4794) distinct from RoCEv2’s 4791.
Enhanced CRC : 32‑bit CRC covers IP, UDP, transport headers and payload, with a fixed‑variable field baseline to guarantee data integrity.
Extension Header Set
Basic Transport Header (BTH) : Retains the original IB format and adds three new opcodes – SACK, ACK_Rsp/SACK_Rsp, RTT Request/Response, and a slow‑path detection opcode.
MSN Extension Header (MSNETH) : Introduces a 24‑bit Message Sequence Number (MSN) to associate packets with complete messages; senders allocate MSN for Send/Write/Read requests, receivers allocate independent MSN for responses.
Packet Offset Extension Header (POETH) : Contains a 24‑bit Packet Offset (PO) that indicates the payload’s location in target memory, enabling Direct Data Placement (DDP) even with out‑of‑order packets; ACK packets carry PO = aPSN+1 to locate missing fragments quickly.
RQ Extension Header (RQETH) : Holds a 24‑bit RQMSN linking Send/Write‑with‑Immediate packets to the receiver’s RQE, solving the mismatch of out‑of‑order packets.
ACK Extension Header (AETH) : Carries a 24‑bit aMSN (largest received message sequence number) for precise request‑response correlation and an 8‑bit Syndrome field that redefines RoCEv2’s “PSN sequence error” as a “Packet Drop NAK”.
SACK Extension Header (SACKETH) : Includes a 24‑bit bitmap start PSN, an 8‑bit valid length, and a 128‑bit PSN bitmap to precisely mark received packet ranges; supports delayed SACK to reduce bandwidth when reordering is low.
RTT Probe Extension Header : Carries four 32‑bit timestamps (Tx1, Rx1, Tx2, Rx2) to compute network RTT as (Rx2‑Tx1)‑(Tx2‑Rx1).
#03 Reliable Connection Service
veRoCE’s RC service guarantees “at‑most‑once, loss‑free delivery” even on lossy, multipath networks. It revolves around three pillars: sequence‑number management, acknowledgment protocol, and loss‑recovery, while staying compatible with existing RDMA semantics and the RoCEv2 ecosystem.
Dual Sequence‑Number Mechanism
Separates packet‑level PSN from message‑level MSN, allocating independent spaces to track packets and messages precisely.
Three‑Way Acknowledgment
ACK: Cumulative acknowledgment of in‑order packets, supporting ACK merging.
SACK: Selective acknowledgment of out‑of‑order packets using a 128‑bit bitmap.
NAK: Explicit notification of packet loss, redefining RoCEv2’s “PSN sequence error” as “Packet Drop NAK”.
Loss Detection and Fast Retransmission
Detection: SACK bitmap directly infers loss; RTO timeout also triggers loss judgment.
Fast Selective Retransmission: Maintains RxtPSN (last retransmitted PSN) and only retransmits packets with PSN > RxtPSN; if RxtPSN stalls, it resets to aPSN to allow a second retransmission pass.
#04 Congestion Control and Path Selection
Flexible Congestion Control (FCC) Framework
veRoCE introduces FCC, which decouples congestion‑signal generation from transmission and supports multiple modes. Core ideas are signal‑independent transmission, multi‑mode selection, and precise delay probing.
Dual‑Mode Signal Delivery: Supports in‑band and out‑of‑band signalling.
Signal Generation Rules: CNP generation rate is configurable, allowing ACK‑like merging strategies.
Precise RTT Probing
RTTReqETH/RTTRspETH carry four 32‑bit timestamps to measure network RTT accurately.
Three RTT derivation methods feed congestion‑control algorithms.
UDP source port of RTT request binds the probe to a specific path; the response reuses the same port, ensuring path‑accurate delay data.
Two Congestion‑Control Modes
Path‑Level Mode (Sender‑Spreading) : Each path gets an independent UDP source port; ECMP at the switch distributes traffic. Each path maintains its own congestion‑control context (CCC) or shares one via weighted round‑robin. Receiver distinguishes paths by CNP source port.
Connection‑Level Mode (Switch Adaptive Routing) : Sender and receiver cannot distinguish paths; they aggregate ECN‑marked packets across all paths, maintain a single global rate, and distribute it evenly. Per‑path RTT still identifies slow paths for optimisation.
Path Selection Module
The module detects degraded paths and shifts traffic to higher‑quality routes, working together with FCC to improve overall network utilisation.
Slow‑Path Detection: When the difference between the received semantic packet PSN and the highest PSN (hPSN) exceeds a threshold, the receiver marks the packet as “slow” and sends a Slow‑Packet Signal to the sender.
Path Marking: If a sender receives multiple slow‑packet signals from the same path within a time window, it marks the path as “slow” and migrates traffic to better paths.
#05 Summary
veRoCE solves the traditional reliance on lossless Ethernet by adding lightweight innovations such as transport‑header extensions, dual sequence numbers, and selective acknowledgments. It preserves RDMA’s core advantages, remains compatible with the existing RoCEv2 ecosystem, and directly addresses the communication bottlenecks of massive AI clusters, opening new performance possibilities for distributed systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
