Understanding RoCE: RoCEv1, RoCEv2, and Soft‑RoCE in Data Center Networks
RoCE (RDMA over Converged Ethernet) enables lossless, high‑performance data transfer in data‑center networks by extending InfiniBand over Ethernet, with RoCEv1 operating at Layer 2, RoCEv2 adding UDP/IPv4/IPv6 routing, and Soft‑RoCE providing a software‑only solution for environments lacking RDMA‑capable hardware.
Ethernet remains dominant in the global Internet, but in high‑bandwidth, low‑latency private networks its limitations have led to the development of Data Center Bridging (DCB) and lossless links based on RDMA/InfiniBand, culminating in the RoCE (RDMA over Converged Ethernet) standard.
RoCEv1
Released in April 2010 by the IBTA as an addendum to the InfiniBand Architecture Specification (also called IBoE), RoCEv1 replaces the TCP/IP layer with the InfiniBand network layer, operates at the Ethernet link layer (Ethertype 0x8915), and does not support IP routing. The InfiniBand link‑layer header is removed and the GUID is mapped to a MAC address. RoCE relies on lossless Ethernet, requiring L2 QoS mechanisms such as Priority Flow Control (PFC); all endpoints, switches and routers must support PFC for the link to function correctly.
RoCEv1 frame structure diagram
For the full protocol specification, see InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE.
RoCEv2
Because RoCEv1 frames lack an IP header and can only communicate within an L2 subnet, IBTA introduced RoCEv2 in 2014. RoCEv2 adds a Global Routing Header (GRH) replaced by a UDP header plus an IP header, enabling routing across L3 networks. The frame structure is shown below.
RoCEv2 frame structure
RoCEv2 packet example
RoCEv1 operates at Layer 2 with Ethertype 0x8915; normal frame size is 1500 bytes, Jumbo frames up to 9000 bytes.
RoCEv2 operates at Layer 3 over UDP/IPv4 or UDP/IPv6, using UDP port 4791; because it is routable, it is sometimes called “Routable RoCE” or “RRoCE”.
Soft‑RoCE
Since Linux kernel 4.9, a software implementation of RoCEv2 called Soft‑RoCE is available. Unlike hardware RoCE, Soft‑RoCE works on any Ethernet environment without requiring RDMA‑capable NICs, switches, or L2 QoS support. It consists of a user‑space library librxe that interfaces with the RDMA stack (libibverbs) and a kernel module rxe.ko that connects to the Linux network stack. A UDP tunnel on a regular Ethernet NIC creates a virtual RDMA device for transmitting RoCE data.
Soft‑RoCE communication diagram
In performance‑sensitive virtualized scenarios, Soft‑RoCE enables VMs to access RDMA functionality without exposing physical NICs, offering a low‑cost way to build efficient RDMA networks in data centers that lack specialized hardware.
Network Requirements
RoCE can operate in both lossless and lossy network environments. In a lossy environment it is referred to as Resilient RoCE; in a lossless environment it is called Lossless RoCE.
Resilient RoCE – operates over lossy networks without requiring PFC/ECN. Link
Lossless RoCE – requires PFC flow‑control to guarantee a lossless fabric. Link
Summary: Although RoCE imposes special dependencies on the link and physical layers, modern switches, NICs, and SoCs typically integrate DCB and RDMA support, making RoCE the optimal choice for new data‑center or SAN deployments. For legacy expansions or cost‑sensitive optimizations, RNIC iWRAP or the hardware‑independent Soft‑RoCE are more appropriate.
References
https://www.cnblogs.com/echo1937/p/7018266.html
http://hustcat.github.io/roce-protocol/
RoCE: An Ethernet‑InfiniBand Love Story
InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE
InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2
RoCEv2 CNP Packet Format Example
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.