Fundamentals 25 min read

Understanding InfiniBand RDMA: Architecture, Advantages, and NVIDIA Quantum-2

InfiniBand RDMA, designed to network server buses, offers high bandwidth and ultra‑low latency through zero‑copy, kernel‑bypass communication, with a layered architecture (L1‑L5) and hardware components like Quantum‑2 Switch, ConnectX‑7 RNIC, and SHARP acceleration, supported by the Verbs API and OFED stack.

AI Cyberspace

Feb 13, 2025

InfiniBand RDMA

InfiniBand was created to network server buses, inheriting the high bandwidth and low latency of bus technology. The DMA technique used in buses is realized as RDMA (Remote Direct Memory Access) in InfiniBand.

InfiniBand is a network designed specifically for RDMA; devices based on InfiniBand typically implement RDMA, guaranteeing reliable transmission at the hardware level. Many TOP500 supercomputers use InfiniBand Architecture (IBA). The earliest vendors were IBM and HP, now primarily NVIDIA's Mellanox. InfiniBand requires proprietary hardware from L2 to L4, making it relatively costly.

The native RDMA specification, IBTA (InfiniBand Trade Association), was released in 2000 and requires NICs and switches that support the technology.

The main features of RDMA are zero‑copy and OS bypass: data moves directly between external devices and application memory without CPU intervention or context switches. Combined with RDMA, InfiniBand enables an application‑centric communication model rather than the node‑centric model of TCP/IP.

Data transfer is handled entirely by InfiniBand devices, bypassing the operating system on network nodes (Kernel Bypass), which improves both transfer efficiency and CPU utilization.

RDMA Advantages

Large Bandwidth

Compared with TCP, in a 100 Gbps RDMA scenario CPU utilization drops from 100 % to 10 %, making CPU no longer the bandwidth bottleneck; the NIC hardware becomes the limiting factor.

In TCP, handling 100 Gbps requires 64 cores of 2.5 GHz (1 MHz per 1 Mbps Net I/O).

In RDMA, the CPU no longer handles packet interrupt processing, reducing latency and saving CPU cycles.

Low Latency

Compared with TCP, network latency drops from the millisecond level to below 10 µs.

In TCP, each packet traverses the kernel stack, causing multiple memory copies, interrupt handling, and context switches that add tens of microseconds of fixed latency.

In RDMA, the application directly uses the RNIC Verbs API without a system‑call transition to kernel mode, eliminating kernel overhead. Packet headers are processed on the RNIC, enabling significant reduction of zero‑copy latency.

IB Architecture

InfiniBand Architecture (IBA) consists of the following components:

Processor Node: CPU, GPU compute nodes.

Storage Node.

HCA (Host Channel Adapter): RNIC card in compute nodes that supports the IB RDMA protocol and connects to an IB Subnet.

TCA (Target Channel Adapter).

IB Switch: supports IB L2 connections.

IB Router: supports IB L3 connections.

IB Subnet: a large IBA network is divided into multiple subnets, each supporting up to 65 536 nodes.

IB Subnet Manager: management platform that configures switches/routers and partitions subnets.

IB Protocol Stack

InfiniBand protocol stack includes L1 Physical, L2 Link, L3 Network, L4 Transport, and L5 Application layers.

L1 Physical Layer

L1 defines electrical/optical signal characteristics and physical connections (cables, connectors), supporting various rates (SDR, DDR, QDR, etc.). Its main functions are to establish physical connections, monitor link status and notify L2, and convey control and data signals between L1 and L2.

Establish physical connection.

Monitor link status and notify L2 when valid.

Transfer control and data signals to and from L2.

IB L1 uses serial data streams and supports SDR, DDR, QDR, FDR, EDR, HDR, etc. Current CX7 RNICs support single‑card single‑port NDR (400 GB/s).

L2 Link Layer and LID Addressing

L2 handles data‑frame transmission within an IB Subnet, providing flow control, virtual lanes (VL), and QoS.

Flow Control

IB L2 uses Credit‑Based Flow Control: before sending a packet, the sender and receiver negotiate a credit amount; the receiver must have enough buffer space before the sender transmits, preventing packet loss and eliminating TCP retransmission delays.

QoS

QoS is achieved via VLs. Each physical link supports up to 15 standard VLs (VL0‑VL14) and one management VL (VL15). SL (Service Level) defines VL priority.

L2 Addressing

Hosts and switches in an IB L2 Subnet use a Local Route Header (LRH) containing a Local Identifier (LID) for two‑layer addressing. Each IB port has a unique LID assigned by the Subnet Manager.

LID structure: 16‑bit identifier (0x0001‑0xFFFE) dynamically allocated by the SM; ports may have multiple LIDs for multipath, and reserved LIDs are used for broadcast/multicast.

LRH frame structure:

Destination LID (DLID)

Source LID (SLID)

Service Level (SL) – maps to a VL

Flow Control – credit information

No GRH for intra‑subnet communication

Other control fields such as frame type and CRC

Exchange addressing process:

SM assigns LIDs to each port.

SM maintains a LID forwarding table in switches (similar to a MAC table).

Sender builds a frame using SLID and DLID.

Switch looks up the LFT using DLID to determine the output port.

View LID address:

$ ibdev2netdev
mlx5_0 port 1 => ibs3 (Up)

$ ibstat
CA 'mlx5_0'
        type: MT4123
        Number of ports: 1
        Firmware version: 20.35.4030
        Hardware version: 0
        Node GUID: 0x946dae03005a928c
        System image GUID: 0x946dae03005a928c
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 1   # LID
                LMC: 0
                SM lid: 1
                Capability mask: 0xa651e84a
                Port GUID: 0x946dae03005a928c
                Link layer: InfiniBand

$ ibv_devinfo -d mlx5_0
hca_id: mlx5_0
        transport:              InfiniBand (0)
        fw_ver:                20.35.4030
        Number of ports: 1
        Port 1:
                state:          PORT_ACTIVE (4)
                max_mtu:        4096 (5)
                active_mtu:     4096 (5)
                sm_lid:         1
                port_lid:       1    # LID
                port_lmc:       0x00
                link_layer:    InfiniBand

L2 connectivity test: Use the ibping tool.

# Server
# -S: run in server mode (run on both ends for bidirectional test)
$ ibping -S

# Client
$ ibping -c 10 ${DLID}

L3 Network Layer and GID Addressing

Three‑Layer Addressing

L3 manages routing across subnets using a Global Route Header (GRH) that carries a 128‑bit Global Identifier (GID), similar to IP addressing. GID types include unicast (identifies a single port) and multicast (identifies a group).

GID can be manually configured or auto‑generated; the LRH field LNH indicates whether GRH is present.

GID structure: 128‑bit address in IPv6 format, split into two parts:

High 64 bits: subnet prefix (similar to IPv6 prefix).

Low 64 bits: GUID (globally unique identifier) burned by the vendor, similar to a MAC address, unique per port.

GRH packet structure:

Source GID (SGID)

Destination GID (DGID)

Hop Limit (similar to IP TTL)

Routing process:

Sender checks whether the destination is in the same subnet.

Initial forwarding within the source subnet uses LID.

IB Router looks up DGID in its routing table (similar to IPv6) to decide the next hop.

Within the destination subnet, LID addressing is used again.

View GID address:

$ ibv_devinfo -v

hca_id: mlx5_0
        transport:              InfiniBand (0)
        fw_ver:                12.28.2006
        node_guid:             9803:9b03:00f3:e0a2
        sys_image_guid:        9803:9b03:00f3:e0a2
        vendor_id:             0x02c9
        vendor_part_id:         4115
        phys_port_cnt:         1
        port: 1
          state:               PORT_ACTIVE (4)
          max_mtu:             4096 (5)
          active_mtu:          4096 (5)
          sm_lid:              12
          port_lid:            24   # LID
          port_lmc:            0x00
          link_layer:          InfiniBand
          GID[0]:              fe80:0000:0000:0000:9803:9b03:00f3:e0a2  # link‑local GID
          GID[1]:              2001:db8::1:9803:9b03:e0a2  # global GID

Three‑layer connectivity test: Use the rping tool.

# Server
$ rping -s -a <local SGID> -v
server: ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWX
server: ping data: rdma-ping-1: ABCDEFGHIJKLMNOPQRSTUVWX

# Client
$ rping -c -a <remote DGID> -v
client: connected to 2001:db8::2:9999
client: ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWX
client: ping data: rdma-ping-1: ABCDEFGHIJKLMNOPQRSTUVWX

IPoIB layer test: If IP over InfiniBand is configured, IPv6 addresses can be used directly.

ping6 -I ib0 fe80::9803:9b03:f3e0:a2   # link‑local GID
ping6 2001:db8::1:9803:9b03:f3e0:a2   # global GID

L4 Transport Layer

IB L4 supports multiple end‑to‑end transport modes such as RC, UC, UD, and RDMA operations including Read, Write, Send/Recv.

L4 uses a Base Transport Header (BTH, 12 bytes) for packet handling, segmentation, QP establishment, and multiplexing. Reliable transport types may also include an Extended Transport Header (ETH, 4‑28 bytes) for additional reliability services.

L5 Application Layer

Application data is encapsulated in a payload (0‑4096 bytes). Applications can directly access remote memory via RDMA interfaces like the Verbs API.

IB Hardware – NVIDIA Quantum‑2 InfiniBand Platform

NVIDIA Quantum‑2 is a next‑generation 400 Gbps InfiniBand platform. Core hardware modules include:

NVIDIA Quantum‑2 Switch

NVIDIA InfiniBand Router

ConnectX‑7 RNIC

BlueField‑3 DPU

Quantum‑2’s key innovation is In‑Network Computing, aiming to compute data where it resides.

Quantum‑2 Switch

InfiniBand Router & Subnet Manager

CX7 RNIC

BF3 DPU

SHARP – Accelerated AI Aggregation Communication Offload

SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) is a network offload technology for aggregation communication such as ML gradient aggregation and FL model aggregation.

In HPC and AI scenarios, many aggregation protocols traverse the global network, causing significant overhead and potential congestion. Software optimizations still leave aggregation latency an order of magnitude higher than point‑to‑point communication.

Mellanox introduced SHARP starting with EDR InfiniBand switches, integrating a compute engine that supports 16‑, 32‑, and 64‑bit fixed‑point or floating‑point operations, offering sum, min, max, and logical operations, as well as Barrier, Reduce, and All‑Reduce.

SHARPv1: on EDR InfiniBand, up to 256 B aggregation offload.

SHARPv2: on HDR InfiniBand, up to 2 GB aggregation offload.

SHARPv3: on NDR InfiniBand, up to 64 GB aggregation offload.

SHARP enables each port in an IB switch to host an RDMA engine that receives packets, reconstructs data, and accelerates applications—most notably MPI aggregation operations.

In multi‑switch clusters, Mellanox defines a SHARP tree: an Aggregation Manager builds a logical SHARP tree over the physical topology. Hosts submit data to their connected switches; each switch aggregates data using its compute engine and forwards results up the tree, with the root switch performing the final reduction and distributing the result back to all hosts.

First‑level switch receives data, computes, and forwards to the next level.

Higher‑level switches aggregate incoming results and continue upward.

Root switch completes the final reduction and returns the result to all hosts.

This approach dramatically reduces aggregation latency, mitigates network congestion, and improves cluster scalability.

IB Software Stack

Verbs API

To exploit InfiniBand performance, a complete software stack is required for applications, namely the Verbs API.

RDMAC and IBTA define RDMA transmission characteristics, while the Open Fabric Alliance (OFA) defines the Verbs interfaces and data structures. OFA also developed the OpenFabrics Enterprise Distribution (OFED) stack, supporting multiple RDMA transport protocols.

Verbs API software stack:

Application layer:

Native RDMA applications using the Verbs API directly.

Legacy applications via an Upper Layer Protocol (ULP) compatibility layer.

ULP layer: OFED libraries providing RDMA support for various protocols, enabling seamless migration to RDMA.

Verbs API layer: RNIC driver API encapsulation handling channel management, memory management, queue management, and data access.

RNIC driver layer: Configures RNIC hardware, manages queues and memory, and processes work requests.

OFED Kernel Modules

OFED appears as a kernel‑mode driver providing channel‑oriented RDMA send/receive operations, kernel bypass, and programming APIs for MPI in both kernel and user space.

Reference documents: MNLX_OFED official documentation: https://docs.nvidia.com/networking/display/ofedv522200/introduction Additional PDF: https://format.com.pl/site/wp-content/uploads/2015/09/pb_ofed.pdf

ULPs support the following legacy application types:

Block storage: SRP, iSER

AI: MPI

RDMA‑based: UDAPL

Socket: RDS, SDP, IPoIB

With OFED in the Linux ecosystem, applications using these ULP libraries can migrate directly from TCP to RDMA networks.