Intel Omni-Path Architecture and InfiniBand: Protocol Layers, Topology, and RDMA Overview
This article explains Intel Omni-Path Architecture and InfiniBand, describing their protocol stack, network topology, RDMA technology, and related protocols such as RoCE and iWARP, while also comparing their design goals and performance characteristics.
Intel Omni-Path Architecture (OPA) is a network architecture similar to InfiniBand, designed to avoid several shortcomings of the PCI bus.
Because it uses a bus‑based shared transmission mode, only one PCI device can transmit at a time; other devices must wait.
When the bus frequency increases (33 MHz → 66 MHz → 133 MHz for PCI‑X), signal interference grows and routing multiple buses on a motherboard becomes increasingly difficult.
PCI devices use memory‑mapped I/O, reserving 50–100 MB per device in system memory; many hot‑plug PCI slots therefore waste a large amount of RAM.
The PCI bus has buffers but no error‑correction; any data loss or corruption triggers an NMI interrupt for the OS.
InfiniBand Protocol Layers and Network Structure
The InfiniBand protocol uses a layered architecture where each layer is independent and provides services to the layer above it. The physical layer defines how bits are turned into symbols, frames, and packets. The link layer defines packet format, flow control, routing, encoding/decoding, etc. The network layer adds a 40‑byte Global Route Header (GRH) for routing and forwarding.
During forwarding, routers perform only a variable CRC check, guaranteeing end‑to‑end data integrity. The transport layer delivers packets to a specific Queue Pair (QP) and handles segmentation/reassembly when the payload exceeds the MTU.
InfiniBand Network Layer Architecture
Physical layer: defines electrical/mechanical characteristics, including fiber and copper cables, connectors, backplane, and thermal properties.
Link layer: describes packet format and operations such as flow control and intra‑subnet routing; it includes management and data packets.
Network layer: provides inter‑subnet packet forwarding similar to IP, but intra‑subnet traffic does not involve this layer.
Transport layer: responsible for message distribution, multiplexing, basic transport services, and segmentation/reassembly when the payload exceeds the MTU.
Upper‑layer network protocols: InfiniBand offers several upper‑layer protocols (SDP, SRP, iSER, RDS, IPoIB, uDAPL) via a Verbs interface for RDMA programming.
InfiniBand Network Topology
The topology consists of four main components:
Host Channel Adapter (HCA) – bridges the memory controller and the Target Channel Adapter.
Target Channel Adapter (TCA) – packages digital signals from I/O devices (e.g., NIC, SCSI controller) and sends them to the HCA.
InfiniBand link – fiber that connects HCA and TCA; manufacturers may provide 1, 4, or 12 fibers per link.
Switches and routers.
Both HCA and TCA are essentially programmable DMA engines with protection features.
Omni‑Path Evolution
Omni‑Path inherits the True Scale product line (acquired from QLogic) and raises the physical‑layer speed from 40 Gb to 100 Gb. It follows the open‑source OFED stack and exposes its API.
Intel integrates Omni‑Path functions into its CPUs, improving communication efficiency but tying the network to the CPU architecture.
After acquiring Cray’s interconnect division, Intel introduced a “link transport layer” (1.5 layer) based on Cray’s Aries technology, providing reliable two‑layer packet delivery, flow control, and link‑level control.
OPA Components
HFI – Host Fabric Interface, provides fiber connections for hosts, services, and management nodes.
Switches – enable arbitrary large‑scale topologies.
Fabric Manager – central provisioning and monitoring of fiber resources.
Design Goals Compared with InfiniBand
CPU/Fabric integration to reduce cost, power consumption, and increase density.
Host‑side optimizations for high‑speed MPI messaging, low latency, and high scalability.
Enhanced Fabric Architecture delivering ultra‑low end‑to‑end latency, efficient error correction, QoS, and massive scalability.
RDMA Technology
RDMA moves the network and transport layers into hardware (the NIC), allowing packets to be parsed and delivered directly to the application without CPU involvement.
Remote Direct Memory Access (RDMA) enables zero‑copy data transfer, eliminating OS‑level memory copies and freeing CPU cycles.
RDMA originally existed only on InfiniBand, but has been extended to Ethernet via RoCE (RDMA over Converged Ethernet) and iWARP (RDMA over TCP/IP), making it widely usable in modern data‑center networks.
Typical RDMA workflow for an NVMe write:
The NVMe driver creates a command capsule and places it on the host RDMA NIC’s send queue.
The capsule is transmitted over the network.
The remote RDMA NIC receives the capsule, extracts the NVMe command, and writes the data into host memory.
The remote host processes the command and sends a completion capsule back via RDMA.
Two Ethernet‑Based RDMA Protocols
RoCE v2: Ethernet → IP → UDP → RoCE.
iWARP: Ethernet → IP → TCP (without TCP flow control/management) → iWARP.
InfiniBand vs. RoCE vs. iWARP
InfiniBand uses cut‑through forwarding and credit‑based flow control for zero‑loss, low‑latency transmission. RoCE achieves comparable performance when Data Center Bridging (DCB) is enabled, though its latency is slightly higher. iWARP relies on standard IP networks; packet loss severely degrades performance, limiting its practical use.
RDMA over TCP (iWARP) Protocol Stack and Operation
Devices equipped with an RNIC (RDMA‑aware NIC) offload all data‑transfer operations; the host CPU is not involved in moving data.
RDMA defines four operations: Send, Write, Read, and Terminate. All except Read generate an RDMA message.
iWARP protocol stack (top three layers) ensures high‑speed interoperability over Ethernet.
RDMA layer converts RDMA reads/writes into RDMA messages and forwards them to the Direct Data Placement (DDP) layer.
DDP segments long RDMA messages into DDP packets for the Marker‑based Protocol‑data‑unit‑Aligned (MPA) layer.
MPA adds a backward flag, length, and CRC to each DDP segment.
TCP schedules the TCP segments for reliable delivery.
IP adds routing information.
The software transport interface between the application and RNIC is called Verbs (RDMA API). There are two types of Verbs:
Memory Verbs (One‑Sided RDMA): Reads, Writes, Atomics – no remote CPU involvement.
Messaging Verbs (Two‑Sided RDMA): Send, Receive – requires remote CPU participation.
InfiniBand, RoCE, and iWARP all share the same Verbs API.
Upper‑Layer Protocols
InfiniBand supports several upper‑layer protocols:
SDP (Sockets Direct Protocol) – runs TCP/IP applications over InfiniBand.
SRP (SCSI RDMA Protocol) – transports SCSI commands via RDMA.
iSER – iSCSI over RDMA, standardized by IETF.
RDS – Reliable Datagram Sockets, similar to UDP, for InfiniBand.
IPoIB – IP‑over‑InfiniBand, provides IP compatibility on InfiniBand networks.
uDAPL – User Direct Access Programming Library, a standard API for RDMA‑enabled interconnects.
IPoIB Details
IPoIB creates an IP‑like layer on top of the InfiniBand RDMA network, allowing unmodified IP applications to run with higher bandwidth.
Because iWARP and RoCE/IBoE already provide IP‑based RDMA, IPoIB devices are not created on those platforms.
Limitations of IPoIB include: only IP‑based applications are supported, MAC address is 20 bytes and not user‑configurable, and VLAN configuration requires knowledge of the corresponding p_key.
Recommended Reading : Kubernetes and Docker Cloud‑Native Technologies
References
http://www.rdmamojo.com/2015/02/16/ip-infiniband-ipoib-architecture/
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-configure_ipoib_using_a_gui
http://www.rdmamojo.com/2015/04/21/working-with-ipoib/
https://weibo.com/p/1001603936363903889917
Disclaimer: The original author is credited; any copyright issues should be reported to the publisher.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
