Why InfiniBand Still Beats Ethernet: Deep Dive into RDMA, Omni‑Path, and Protocol Layers
This article provides a comprehensive technical analysis of InfiniBand architecture, its protocol stack, comparison with Ethernet‑based RDMA solutions like RoCE and iWARP, and an overview of Omni‑Path, highlighting performance advantages, design trade‑offs, and practical limitations.
InfiniBand Overview
InfiniBand is a high‑performance interconnect designed to overcome several inherent limitations of the traditional PCI bus:
PCI uses a shared transmission mode; only one device can use the bus at a time.
Higher bus frequencies (e.g., 66 MHz, 133 MHz) increase signal interference and make multi‑bus board layouts difficult.
PCI devices reserve large memory‑mapped I/O address ranges (50–100 MiB per device), reducing usable RAM, especially with hot‑plug PCI slots.
PCI lacks built‑in error‑correction; transmission errors trigger NMI interrupts without automatic recovery.
Protocol Layer Stack
InfiniBand follows a strict layered model. Each layer provides services to the layer above while abstracting hardware details.
Physical layer : Defines electrical/mechanical characteristics of copper or optical cables, connectors, and backplane interfaces.
Link layer : Specifies packet format, flow control, and intra‑subnet routing. It includes link‑management packets and data packets.
Network layer : Adds a 40‑byte Global Route Header (GRH) for inter‑subnet routing, analogous to the IP layer in Ethernet.
Transport layer : Performs segmentation/reassembly when payload exceeds the MTU, distributes packets to Queue Pairs (QPs), and ensures reliable delivery.
Upper‑layer protocols : Provide application‑level services (e.g., SDP, SRP, iSER, RDS, IPoIB, uDAPL) via the Verbs API.
Network Topology
A typical InfiniBand fabric consists of four component types:
Host Channel Adapter (HCA) : Connects a host’s memory controller to the fabric and acts as a programmable DMA engine.
Target Channel Adapter (TCA) : Provides DMA‑engine capabilities for I/O devices (e.g., NICs, SCSI controllers).
InfiniBand links : Optical (or copper) fibers that interconnect HCAs and TCAs. Vendors may bundle 1, 4, or 12 fibers per link.
Switches / routers : Provide scalable, non‑blocking connectivity and routing between subnets.
Omni‑Path Architecture
Intel’s Omni‑Path (OPA) builds on the former QLogic InfiniPath (True Scale) family. Key technical points:
Physical‑layer speed increased from 40 Gb/s to 100 Gb/s.
Implements the open‑source OFED stack, exposing the same Verbs API as InfiniBand.
CPU‑fabric integration reduces latency but ties the network to Intel CPUs.
Acquisition of Cray’s interconnect division added a “Link Transport Layer” derived from Cray’s Aries technology, delivering reliable two‑layer packet delivery, credit‑based flow control, and per‑link management.
Core components:
HFI (Host Fabric Interface) – optical connectivity for hosts and management nodes.
Switches – support arbitrary large‑scale topologies.
Fabric Manager – centralized provisioning and monitoring of optical resources.
RDMA Fundamentals
Remote Direct Memory Access (RDMA) moves data directly between the memories of two hosts without CPU involvement, eliminating copies and reducing latency. RDMA operations are issued via the Verbs API and fall into two categories:
Memory verbs (one‑sided) : RDMA Read, Write, and Atomic. The remote CPU does not participate.
Messaging verbs (two‑sided) : Send and Receive. Both endpoints must post matching buffers.
RDMA over Ethernet: RoCE and iWARP
Two Ethernet‑based RDMA protocols extend the benefits of InfiniBand to data‑center Ethernet fabrics:
RoCE (RDMA over Converged Ethernet) maps InfiniBand’s cut‑through forwarding and credit‑based flow control onto Ethernet. It requires Data Center Bridging (DCB) to guarantee lossless transport.
iWARP (RDMA over TCP) encapsulates RDMA verbs over TCP/IP, allowing deployment on standard IP networks. Performance degrades sharply with packet loss because TCP’s congestion control introduces latency.
iWARP Protocol Stack
The iWARP stack is built on top of the standard TCP/IP layers:
RDMA layer : Converts RDMA Read/Write requests into RDMA messages.
Direct Data Placement (DDP) layer : Segments long messages and places incoming data directly into the target buffer.
Marker‑based Protocol‑data‑unit‑Aligned (MPA) layer : Adds a marker, length field, and CRC to each DDP segment.
TCP layer : Provides reliable, ordered delivery of MPA segments.
IP layer : Supplies routing information.
Upper‑Layer Protocols on InfiniBand
InfiniBand supports several higher‑level protocols that enable familiar application models:
SDP (Sockets Direct Protocol) : Allows existing TCP/IP socket applications to run over InfiniBand without code changes.
SRP (SCSI RDMA Protocol) : Encapsulates SCSI commands for block‑storage access over RDMA.
iSER (iSCSI Extensions for RDMA) : Provides iSCSI‑style storage over RDMA, standardized by IETF.
RDS (Reliable Datagram Sockets) : UDP‑like datagram service built on InfiniBand, originally from Oracle.
IPoIB (IP over InfiniBand) : Implements an IP layer on top of InfiniBand, exposing a standard Ethernet‑like interface.
uDAPL (User Direct Access Programming Library) : A vendor‑neutral API for RDMA‑capable fabrics.
IPoIB Limitations
MAC address length is 20 bytes, not user‑configurable.
VLAN configuration requires knowledge of the corresponding Subnet Manager (SM) partition key (p_key).
SM/SA services must be continuously available for IPoIB to function.
IPoIB is unnecessary on RoCE or iWARP fabrics because those protocols already provide IP‑based RDMA.
References
http://www.rdmamojo.com/2015/02/16/ip-infiniband-ipoib-architecture/
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configure_ipoib_using_a_gui
http://www.rdmamojo.com/2015/04/21/working-with-ipoib/
https://weibo.com/p/1001603936363903889917
Code example
RDMA应用和RNIC(RDMA-aware Network Interface Controller)之间的传输接口层(Software Transport Interface)被称为Verbs或RDMA API,RDMA API (Verbs),主要有两种Verbs:
内存Verbs(Memory Verbs),也叫One-SidedRDMA。包括RDMA Reads, RDMA Writes, RDMA Atomic。这种模式下的RDMA访问完全不需要远端机的任何确认。
消息Verbs(Messaging Verbs),也叫Two-SidedRDMA。包括RDMA Send, RDMA Receive。这种模式下的RDMA访问需要远端机CPU的参与。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
