Understanding InfiniBand RDMA: Architecture, Advantages, and NVIDIA Quantum-2
InfiniBand RDMA, designed to network server buses, offers high bandwidth and ultra‑low latency through zero‑copy, kernel‑bypass communication, with a layered architecture (L1‑L5) and hardware components like Quantum‑2 Switch, ConnectX‑7 RNIC, and SHARP acceleration, supported by the Verbs API and OFED stack.
InfiniBand RDMA
InfiniBand was created to network server buses, inheriting the high bandwidth and low latency of bus technology. The DMA technique used in buses is realized as RDMA (Remote Direct Memory Access) in InfiniBand.
InfiniBand is a network designed specifically for RDMA; devices based on InfiniBand typically implement RDMA, guaranteeing reliable transmission at the hardware level. Many TOP500 supercomputers use InfiniBand Architecture (IBA). The earliest vendors were IBM and HP, now primarily NVIDIA's Mellanox. InfiniBand requires proprietary hardware from L2 to L4, making it relatively costly.
The native RDMA specification, IBTA (InfiniBand Trade Association), was released in 2000 and requires NICs and switches that support the technology.
The main features of RDMA are zero‑copy and OS bypass: data moves directly between external devices and application memory without CPU intervention or context switches. Combined with RDMA, InfiniBand enables an application‑centric communication model rather than the node‑centric model of TCP/IP.
Data transfer is handled entirely by InfiniBand devices, bypassing the operating system on network nodes (Kernel Bypass), which improves both transfer efficiency and CPU utilization.
RDMA Advantages
Large Bandwidth
Compared with TCP, in a 100 Gbps RDMA scenario CPU utilization drops from 100 % to 10 %, making CPU no longer the bandwidth bottleneck; the NIC hardware becomes the limiting factor.
In TCP, handling 100 Gbps requires 64 cores of 2.5 GHz (1 MHz per 1 Mbps Net I/O).
In RDMA, the CPU no longer handles packet interrupt processing, reducing latency and saving CPU cycles.
Low Latency
Compared with TCP, network latency drops from the millisecond level to below 10 µs.
In TCP, each packet traverses the kernel stack, causing multiple memory copies, interrupt handling, and context switches that add tens of microseconds of fixed latency.
In RDMA, the application directly uses the RNIC Verbs API without a system‑call transition to kernel mode, eliminating kernel overhead. Packet headers are processed on the RNIC, enabling significant reduction of zero‑copy latency.
IB Architecture
InfiniBand Architecture (IBA) consists of the following components:
Processor Node: CPU, GPU compute nodes.
Storage Node.
HCA (Host Channel Adapter): RNIC card in compute nodes that supports the IB RDMA protocol and connects to an IB Subnet.
TCA (Target Channel Adapter).
IB Switch: supports IB L2 connections.
IB Router: supports IB L3 connections.
IB Subnet: a large IBA network is divided into multiple subnets, each supporting up to 65 536 nodes.
IB Subnet Manager: management platform that configures switches/routers and partitions subnets.
IB Protocol Stack
InfiniBand protocol stack includes L1 Physical, L2 Link, L3 Network, L4 Transport, and L5 Application layers.
L1 Physical Layer
L1 defines electrical/optical signal characteristics and physical connections (cables, connectors), supporting various rates (SDR, DDR, QDR, etc.). Its main functions are to establish physical connections, monitor link status and notify L2, and convey control and data signals between L1 and L2.
Establish physical connection.
Monitor link status and notify L2 when valid.
Transfer control and data signals to and from L2.
IB L1 uses serial data streams and supports SDR, DDR, QDR, FDR, EDR, HDR, etc. Current CX7 RNICs support single‑card single‑port NDR (400 GB/s).
L2 Link Layer and LID Addressing
L2 handles data‑frame transmission within an IB Subnet, providing flow control, virtual lanes (VL), and QoS.
Flow Control
IB L2 uses Credit‑Based Flow Control: before sending a packet, the sender and receiver negotiate a credit amount; the receiver must have enough buffer space before the sender transmits, preventing packet loss and eliminating TCP retransmission delays.
QoS
QoS is achieved via VLs. Each physical link supports up to 15 standard VLs (VL0‑VL14) and one management VL (VL15). SL (Service Level) defines VL priority.
L2 Addressing
Hosts and switches in an IB L2 Subnet use a Local Route Header (LRH) containing a Local Identifier (LID) for two‑layer addressing. Each IB port has a unique LID assigned by the Subnet Manager.
LID structure: 16‑bit identifier (0x0001‑0xFFFE) dynamically allocated by the SM; ports may have multiple LIDs for multipath, and reserved LIDs are used for broadcast/multicast.
LRH frame structure:
Destination LID (DLID)
Source LID (SLID)
Service Level (SL) – maps to a VL
Flow Control – credit information
No GRH for intra‑subnet communication
Other control fields such as frame type and CRC
Exchange addressing process:
SM assigns LIDs to each port.
SM maintains a LID forwarding table in switches (similar to a MAC table).
Sender builds a frame using SLID and DLID.
Switch looks up the LFT using DLID to determine the output port.
View LID address:
$ ibdev2netdev
mlx5_0 port 1 => ibs3 (Up)
$ ibstat
CA 'mlx5_0'
type: MT4123
Number of ports: 1
Firmware version: 20.35.4030
Hardware version: 0
Node GUID: 0x946dae03005a928c
System image GUID: 0x946dae03005a928c
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 1 # LID
LMC: 0
SM lid: 1
Capability mask: 0xa651e84a
Port GUID: 0x946dae03005a928c
Link layer: InfiniBand
$ ibv_devinfo -d mlx5_0
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.35.4030
Number of ports: 1
Port 1:
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1 # LID
port_lmc: 0x00
link_layer: InfiniBandL2 connectivity test: Use the ibping tool.
# Server
# -S: run in server mode (run on both ends for bidirectional test)
$ ibping -S
# Client
$ ibping -c 10 ${DLID}L3 Network Layer and GID Addressing
Three‑Layer Addressing
L3 manages routing across subnets using a Global Route Header (GRH) that carries a 128‑bit Global Identifier (GID), similar to IP addressing. GID types include unicast (identifies a single port) and multicast (identifies a group).
GID can be manually configured or auto‑generated; the LRH field LNH indicates whether GRH is present.
GID structure: 128‑bit address in IPv6 format, split into two parts:
High 64 bits: subnet prefix (similar to IPv6 prefix).
Low 64 bits: GUID (globally unique identifier) burned by the vendor, similar to a MAC address, unique per port.
GRH packet structure:
Source GID (SGID)
Destination GID (DGID)
Hop Limit (similar to IP TTL)
Routing process:
Sender checks whether the destination is in the same subnet.
Initial forwarding within the source subnet uses LID.
IB Router looks up DGID in its routing table (similar to IPv6) to decide the next hop.
Within the destination subnet, LID addressing is used again.
View GID address:
$ ibv_devinfo -v
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.28.2006
node_guid: 9803:9b03:00f3:e0a2
sys_image_guid: 9803:9b03:00f3:e0a2
vendor_id: 0x02c9
vendor_part_id: 4115
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 12
port_lid: 24 # LID
port_lmc: 0x00
link_layer: InfiniBand
GID[0]: fe80:0000:0000:0000:9803:9b03:00f3:e0a2 # link‑local GID
GID[1]: 2001:db8::1:9803:9b03:e0a2 # global GIDThree‑layer connectivity test: Use the rping tool.
# Server
$ rping -s -a <local SGID> -v
server: ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWX
server: ping data: rdma-ping-1: ABCDEFGHIJKLMNOPQRSTUVWX
# Client
$ rping -c -a <remote DGID> -v
client: connected to 2001:db8::2:9999
client: ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWX
client: ping data: rdma-ping-1: ABCDEFGHIJKLMNOPQRSTUVWXIPoIB layer test: If IP over InfiniBand is configured, IPv6 addresses can be used directly.
ping6 -I ib0 fe80::9803:9b03:f3e0:a2 # link‑local GID
ping6 2001:db8::1:9803:9b03:f3e0:a2 # global GIDL4 Transport Layer
IB L4 supports multiple end‑to‑end transport modes such as RC, UC, UD, and RDMA operations including Read, Write, Send/Recv.
L4 uses a Base Transport Header (BTH, 12 bytes) for packet handling, segmentation, QP establishment, and multiplexing. Reliable transport types may also include an Extended Transport Header (ETH, 4‑28 bytes) for additional reliability services.
L5 Application Layer
Application data is encapsulated in a payload (0‑4096 bytes). Applications can directly access remote memory via RDMA interfaces like the Verbs API.
IB Hardware – NVIDIA Quantum‑2 InfiniBand Platform
NVIDIA Quantum‑2 is a next‑generation 400 Gbps InfiniBand platform. Core hardware modules include:
NVIDIA Quantum‑2 Switch
NVIDIA InfiniBand Router
ConnectX‑7 RNIC
BlueField‑3 DPU
Quantum‑2’s key innovation is In‑Network Computing, aiming to compute data where it resides.
Quantum‑2 Switch
InfiniBand Router & Subnet Manager
CX7 RNIC
BF3 DPU
SHARP – Accelerated AI Aggregation Communication Offload
SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is a network offload technology for aggregation communication such as ML gradient aggregation and FL model aggregation.
In HPC and AI scenarios, many aggregation protocols traverse the global network, causing significant overhead and potential congestion. Software optimizations still leave aggregation latency an order of magnitude higher than point‑to‑point communication.
Mellanox introduced SHARP starting with EDR InfiniBand switches, integrating a compute engine that supports 16‑, 32‑, and 64‑bit fixed‑point or floating‑point operations, offering sum, min, max, and logical operations, as well as Barrier, Reduce, and All‑Reduce.
SHARPv1: on EDR InfiniBand, up to 256 B aggregation offload.
SHARPv2: on HDR InfiniBand, up to 2 GB aggregation offload.
SHARPv3: on NDR InfiniBand, up to 64 GB aggregation offload.
SHARP enables each port in an IB switch to host an RDMA engine that receives packets, reconstructs data, and accelerates applications—most notably MPI aggregation operations.
In multi‑switch clusters, Mellanox defines a SHARP tree: an Aggregation Manager builds a logical SHARP tree over the physical topology. Hosts submit data to their connected switches; each switch aggregates data using its compute engine and forwards results up the tree, with the root switch performing the final reduction and distributing the result back to all hosts.
First‑level switch receives data, computes, and forwards to the next level.
Higher‑level switches aggregate incoming results and continue upward.
Root switch completes the final reduction and returns the result to all hosts.
This approach dramatically reduces aggregation latency, mitigates network congestion, and improves cluster scalability.
IB Software Stack
Verbs API
To exploit InfiniBand performance, a complete software stack is required for applications, namely the Verbs API.
RDMAC and IBTA define RDMA transmission characteristics, while the Open Fabric Alliance (OFA) defines the Verbs interfaces and data structures. OFA also developed the OpenFabrics Enterprise Distribution (OFED) stack, supporting multiple RDMA transport protocols.
Verbs API software stack:
Application layer:
Native RDMA applications using the Verbs API directly.
Legacy applications via an Upper Layer Protocol (ULP) compatibility layer.
ULP layer: OFED libraries providing RDMA support for various protocols, enabling seamless migration to RDMA.
Verbs API layer: RNIC driver API encapsulation handling channel management, memory management, queue management, and data access.
RNIC driver layer: Configures RNIC hardware, manages queues and memory, and processes work requests.
OFED Kernel Modules
OFED appears as a kernel‑mode driver providing channel‑oriented RDMA send/receive operations, kernel bypass, and programming APIs for MPI in both kernel and user space.
Reference documents: MNLX_OFED official documentation: https://docs.nvidia.com/networking/display/ofedv522200/introduction Additional PDF: https://format.com.pl/site/wp-content/uploads/2015/09/pb_ofed.pdf
ULPs support the following legacy application types:
Block storage: SRP, iSER
AI: MPI
RDMA‑based: UDAPL
Socket: RDS, SDP, IPoIB
With OFED in the Linux ecosystem, applications using these ULP libraries can migrate directly from TCP to RDMA networks.
UFM Management Platform
The Quantum‑IB platform includes many Switch/Router devices that require a management platform: UFM.
UFM provides device registration, configuration, monitoring, and alerting.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
