Cloud Computing 21 min read

Mellanox InfiniBand Technology Overview: Architecture, Protocol Stack, and Product Portfolio

This article provides a comprehensive overview of Mellanox's InfiniBand solutions, covering the company's background, network architecture, routing algorithms, Fat‑Tree topology, the OFED software stack, management tools, MPI support, adapters, switches, routers, cables, and related products for high‑performance computing and cloud data centers.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Mellanox InfiniBand Technology Overview: Architecture, Protocol Stack, and Product Portfolio

Mellanox was founded in 1999 with headquarters in California, USA, and Israel, and is a leading supplier of end‑to‑end InfiniBand solutions for servers and storage. By the end of 2010, Mellanox completed the acquisition of Voltaire, a well‑known InfiniBand switch vendor, expanding its capabilities in HPC, cloud computing, data centers, enterprise computing, and storage markets.

Another major InfiniBand vendor is Intel, which invested US$125 million to acquire QLogic's InfiniBand switch and adapter product lines to strengthen its presence in high‑performance computing, but this article focuses on Mellanox's products, technologies, and trends.

IB Network and Topology Composition

InfiniBand replaces a shared bus with a channel‑based serial architecture, separating the I/O subsystem from CPU/memory. All systems and nodes connect to this structure via channel adapters—hosts, HCAs (Host Channel Adapters), TCAs (Target Channel Adapters), as well as InfiniBand switches and routers—to meet growing bandwidth demands.

InfiniBand is also a layered protocol (similar to TCP/IP), where each layer provides distinct functions and serves the layers above it. The protocol suite supports multicast, partitioning, IP compatibility, flow control, and rate control, among other features.

InfiniBand routing algorithms include the shortest‑path algorithm, the Min‑Hop based UPDN algorithm, and the Fat‑Tree based FatTree algorithm.

These algorithms influence the network topology, especially in high‑performance computing and large clusters, where topology and link congestion directly affect overall performance. Because tree topologies are clear, easy to build, and manageable, Fat‑Tree designs are often adopted to exploit InfiniBand’s advantages in low‑blocking or near‑zero‑blocking scenarios.

In a traditional three‑tier architecture (also used in two‑tier designs), the large number of access‑layer nodes requires the aggregation and core layers to provide matching bandwidth and processing capacity; otherwise, the topology will suffer from congestion.

To address this, the aggregation and core layers must use “fat” nodes (fat‑tree). A fat node provides enough ports and bandwidth to match leaf nodes, preventing congestion that would occur with “thin” nodes.

Fat‑Tree topologies consist of leaf (Leaf) and spine (Spine) switches. Leaf switches connect to servers or storage channel adapters, allocating some ports to nodes and the rest to the network. In InfiniBand, Fat‑Tree networks have the following characteristics:

Ports connected to the same downstream switch form a port group; switches at the same Rank must have identical upstream port groups, and the root Rank has no upstream group. Apart from leaf switches, switches at the same Rank must also have identical downstream port groups.

Each upstream port group within a Rank contains the same number of ports, and each downstream port group within a Rank also contains the same number of ports.

All endpoint HCAs reside at the same Rank level.

The above diagram shows a two‑tier, non‑blocking Fat‑Tree example: the access layer provides 1 296 IB ports to servers or storage adapters, and the uplink also offers sufficient adapters to the aggregation layer. From the perspective of an access IB switch, 18 ports are allocated for both uplink and downlink, achieving a non‑blocking network. Fat‑Tree topology thus delivers low‑blocking data transfer and enhanced redundancy.

Software Protocol Stack OFED

OFED (OpenFabrics Enterprise Distribution) provides low latency and high bandwidth for enterprise data centers (EDC), high‑performance computing (HPC), and embedded applications. All Mellanox adapters are compatible with the RDMA protocols of the OpenFabrics software stack. The OpenFabrics Alliance, founded in 2004, released the first OFED version in 2005.

Mellanox OFED is a unified software stack that includes drivers, middleware, user interfaces, and a suite of standard protocols such as IPoIB, SDP, SRP, iSER, RDS, and DAPL (Direct Access Programming Library). It supports MPI, Lustre/NFS over RDMA, provides the Verbs programming interface, and is maintained by the open‑source OpenFabrics community.

If the previous logical diagram appears complex, refer to the simplified illustration above. Mellanox OFED for Linux (MLNX_OFED_LINUX) is distributed as an ISO image containing source code, binary RPM packages, firmware, utilities, installation scripts, and documentation for each Linux distribution.

InfiniBand Network Management

OpenSM is an InfiniBand Subnet Manager (SM) that runs on the Mellanox OFED stack to manage IB networks. It controls traffic flow within the data plane, providing in‑band management.

OpenSM comprises three components: the Subnet Manager, the Backbone Manager, and the Performance Manager. Integrated into switches, it offers comprehensive management and monitoring capabilities such as automatic device discovery, configuration, fabric visualization, intelligent analysis, and health monitoring.

Parallel Computing Cluster Capability

MPI (Message Passing Interface) is a standard for parallel programming that enables multiple CPUs to work together, boosting computational power. Mellanox OFED for Linux implements InfiniBand‑based MPI via Open MPI and OSU MVAPICH.

Open MPI is an open‑source MPI‑2 implementation derived from the Open MPI project, while OSU MVAPICH is based on Ohio State University’s MPI‑1 implementation. Below are useful MPI links.

RDS (Reliable Datagram Socket) is a socket API that provides reliable, ordered datagram delivery over RC or TCP/IP, and is used by Oracle RAC 11g.

Socket‑Based Network Application Capability

IPoIB/EoIB (IP/Ethernet over InfiniBand) implements network interfaces over InfiniBand; IPoIB encapsulates IP packets for transmission over InfiniBand links.

SDP (Socket Direct Protocol) offers TCP‑like byte‑stream semantics over InfiniBand, leveraging its offload capabilities to achieve lower latency and higher bandwidth.

Storage Support Capability

InfiniBand supports iSER (iSCSI Extensions for RDMA), NFSoRDMA (NFS over RDMA), and SRP (SCSI RDMA Protocol). SRP packages SCSI commands for RDMA transmission, enabling shared storage devices and RDMA‑based communication services.

RDMA (Remote Direct Memory Access) eliminates server‑side processing latency by moving data directly between the memory of remote systems over the network, bypassing the operating system. This reduces memory copy overhead and frees CPU cycles for application performance improvements.

Mellanox Product Introduction

Mellanox is a leading provider of end‑to‑end connectivity solutions for servers and storage, dedicated to InfiniBand and Ethernet interconnect products, and is recognized as a benchmark for ultra‑high‑speed networking. Below we focus on InfiniBand and related product offerings.

InfiniBand products, combined with advanced VPI technology, meet diverse port‑level requirements. The portfolio includes VPI series NICs and switches, chipset families that ensure reliability, a wide range of cables for high‑speed interconnects, and complementary acceleration and management software.

InfiniBand Switches

InfiniBand switches provide point‑to‑point high‑speed communication, using LID technology to route data between ports. Current switches support 18 to 864 nodes and various speeds such as SDR (10 Gbps), DDR (20 Gbps), QDR (40 Gbps), FDR10 (40 Gbps), and FDR (56 Gbps).

SwitchX supports 10, 20, 40, and 56 Gbps IB standards; the next‑generation Switch IB supports 100 Gbps EDR and remains backward compatible. SwitchX3 adds 100 Gbps and IB EDR support.

Using ConnectX NICs and SwitchX switches, Mellanox enables Virtual Protocol Interconnect (VPI) to bridge Ethernet and InfiniBand. VPI supports whole‑system, port‑level, and bridge modes, allowing a single physical switch to operate in either InfiniBand or Ethernet mode, or both simultaneously.

Edge (rack) InfiniBand switches support 8 to 36 ports, offering non‑blocking 40‑100 Gbps links. In a 1U form factor they deliver up to 7.2 Tbps bandwidth, making them ideal for leaf nodes in small‑to‑medium non‑blocking clusters. They employ advanced InfiniBand features such as adaptive routing, congestion control, and QoS.

Core InfiniBand switches support 108 to 648 ports with full‑duplex 40‑100 Gbps links, delivering 8.4 Tbps to 130 Tbps within a single chassis and scaling to thousands of ports. Redundant design ensures carrier‑grade availability for mission‑critical workloads.

InfiniBand Adapters

Host Channel Adapters (HCAs) are PCIe cards that connect servers to InfiniBand fabrics, available in single‑ or dual‑port configurations (PCIe 8×). Current mainstream chips include QDR, FDR (ConnectX‑3), and OSCA (also ConnectX‑3).

Target Channel Adapters (TCAs) provide InfiniBand connectivity to I/O devices such as storage or gateway equipment.

InfiniBand Routers and Gateway Devices

InfiniBand routers forward packets between different subnets. Mellanox’s SB7780, based on the Switch‑IB ASIC, offers 100 Gbps EDR ports and can interconnect heterogeneous topologies, allowing storage subnets to use Fat‑Tree while compute subnets employ ring topologies optimized for their workloads.

The SX6036G gateway, built on Mellanox’s sixth‑generation SwitchX‑2, provides high‑performance, low‑latency 56 Gbps FDR InfiniBand to 40 Gbps Ethernet conversion and supports Virtual Protocol Interconnect (VPI) for simultaneous InfiniBand and Ethernet operation.

InfiniBand Cables and Transceivers

Mellanox LinkX interconnect products include copper, active optical, and transceiver modules for single‑mode and multimode fiber, supporting speeds of 10, 25, 40, 50, and 100 Gbps.

LinkX series also offers 200 Gbps and 400 Gbps cables and transceivers, enabling end‑to‑end 200 Gbps solutions for InfiniBand infrastructures.

9th China Cloud Computing Conference – For details, ticketing, and free professional ticket applications, see the article “9th China Cloud Computing Conference Schedule Revealed” at this link .

Warm reminder: search for “ICT_Architect” or scan the QR code below to follow the public account for more exciting content.

cloud computingHigh Performance ComputingInfiniBandMellanoxFat Treedata center networkingOFED
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.