Why InfiniBand Still Beats Ethernet: Deep Dive into RDMA, Omni‑Path, and iWARP
This article provides a comprehensive technical analysis of InfiniBand’s protocol layers, topology, and performance advantages, compares Omni‑Path’s architecture, explains RDMA fundamentals, and details Ethernet‑based RDMA protocols such as RoCE and iWARP, highlighting their trade‑offs and use cases.
InfiniBand Protocol Stack
InfiniBand follows a strict layered architecture. Each layer provides services to the layer above while remaining independent.
Physical Layer : Defines electrical/mechanical characteristics, media (copper or optical), and the conversion of bits to symbols and frames.
Link Layer : Specifies packet format, flow‑control, routing, and encoding/decoding. Two packet types exist – management packets and data packets.
Network Layer : Adds a 40‑byte Global Route Header (GRH) to each packet for inter‑subnet routing. Routers perform only a CRC check, leaving end‑to‑end integrity to the transport layer.
Transport Layer : Delivers packets to a Queue Pair (QP). If a message exceeds the MTU, the transport layer segments it and reassembles on reception.
Upper‑Layer Protocols : Expose a Verbs API (Memory Verbs for one‑sided RDMA – Read, Write, Atomic; Messaging Verbs for two‑sided RDMA – Send, Receive) and support protocols such as SDP, SRP, iSER, RDS, IPoIB, and uDAPL.
InfiniBand Topology
The fabric consists of four logical components:
Host Channel Adapter (HCA) – bridges the host memory controller to the fabric.
Target Channel Adapter (TCA) – packages I/O device signals for the HCA.
InfiniBand Link – optical fiber (1, 4 or 12 lanes) connecting HCA and TCA.
Switches / Routers – provide arbitrary fabric connectivity.
Both HCA and TCA are programmable DMA engines with protection features.
Omni‑Path Architecture
Omni‑Path (derived from QLogic’s True Scale line) upgrades the physical layer to 100 Gbps and follows the open‑source OFED framework. Intel integrates Omni‑Path functions into its CPUs, creating a tightly coupled CPU‑fabric solution.
After acquiring Cray’s interconnect division, Intel added a Link Transport Layer based on Cray’s Aries technology. This layer provides a reliable two‑hop packet delivery model, flow control, and congestion management.
Key Omni‑Path Components
HFI – Host Fabric Interface for host‑to‑fabric optical connections.
Switches – enable large‑scale, arbitrary topologies.
Fabric Manager – central provisioning and monitoring of optical resources.
RDMA Overview
Remote Direct Memory Access (RDMA) moves data directly between the memory of two hosts without CPU involvement, eliminating memory copies and reducing latency. The NIC (RNIC) parses and forwards packets up to the application layer, offloading network, transport, and sometimes application processing.
Typical benefits:
Sub‑microsecond latency for NVMe operations.
Reduced CPU utilization and higher effective bandwidth.
Zero‑copy data movement across the network.
iWARP (RDMA over TCP) Protocol Stack
iWARP implements RDMA on standard TCP/IP networks. The stack (top‑down) is:
RDMA Layer – translates RDMA Read/Write/Atomic operations into RDMA messages.
Direct Data Placement (DDP) Layer – segments long RDMA messages into DDP packets.
Marker‑based Protocol‑data‑unit‑Aligned (MPA) Layer – adds a marker, length field, and CRC to each DDP packet.
TCP Layer – provides reliable, ordered delivery of MPA packets.
IP Layer – supplies routing information.
The user‑space interface is the Verbs API, offering:
Memory Verbs (One‑Sided) : RDMA Read, Write, Atomic – no remote CPU participation.
Messaging Verbs (Two‑Sided) : RDMA Send, Receive – remote CPU must post a matching receive.
Upper‑Layer Protocols for InfiniBand
SDP (Sockets Direct Protocol) – runs TCP‑based applications over InfiniBand.
SRP (SCSI RDMA Protocol) – transports SCSI commands via RDMA for storage.
iSER (iSCSI RDMA Protocol) – RDMA‑accelerated iSCSI.
RDS (Reliable Datagram Sockets) – UDP‑like socket communication on InfiniBand.
IPoIB (IP over InfiniBand) – provides an IP layer on top of InfiniBand, making the fabric transparent to existing IP applications.
uDAPL (User Direct Access Programming Library) – standard API for RDMA‑enabled interconnects.
IPoIB Details and Limitations
IPoIB creates a 20‑byte MAC address and requires a continuously available Subnet Manager (SM) and Subnet Administrator (SA). Limitations include:
Only IP‑based applications can use the interface.
MAC address is not user‑configurable and may change on module reload.
VLAN configuration depends on the p_key managed by the SM.
IPoIB cannot be instantiated on top of iWARP or RoCE devices because those already provide an IP‑compatible RDMA path.
Comparison of RDMA over InfiniBand, RoCE, and iWARP
All three expose the same Verbs API, but their transport characteristics differ:
InfiniBand uses cut‑through switching and credit‑based flow control, delivering the lowest latency and zero packet loss.
RoCE (RDMA over Converged Ethernet) maps InfiniBand‑style RDMA onto Ethernet. It requires Data Center Bridging (DCB) to guarantee lossless delivery; latency is slightly higher than native IB.
iWARP runs RDMA over standard TCP/IP. It inherits TCP’s reliability but suffers dramatically when packet loss occurs, limiting its suitability for latency‑sensitive workloads.
NVMe Write Example Using RDMA
1. The NVMe driver packages a write command and data into a capsule .
2. The capsule is placed on the host RNIC’s send queue and transmitted via RDMA_SEND.
3. The remote RNIC receives the capsule, strips the RDMA envelope, and writes the command and data into host memory.
4. The remote host processes the NVMe command and returns a completion capsule via RDMA back to the initiator.
References
http://www.rdmamojo.com/2015/02/16/ip-infiniband-ipoib-architecture/
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configure_ipoib_using_a_gui
http://www.rdmamojo.com/2015/04/21/working-with-ipoib/
https://weibo.com/p/1001603936363903889917
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
