Why InfiniBand Beats TCP/IP: Deep Dive into Architecture and Socket Direct
This article explains how InfiniBand’s RDMA‑based architecture, layered protocol stack, and Mellanox Socket Direct technology deliver far higher bandwidth, lower latency, and better CPU efficiency than traditional TCP/IP networks, and it presents performance test results that show up to an 80% latency reduction.
Background and Motivation
Traditional TCP/IP’s multi‑layered design incurs significant buffering, network latency, and operating‑system overhead, which limits performance for large‑scale clusters. As network demands grew for open, high‑bandwidth, low‑latency, and highly reliable communication, InfiniBand (IB) emerged as a switched‑fabric architecture that addresses these challenges.
Key Features of InfiniBand
InfiniBand leverages RDMA (Remote Direct Memory Access) to let a server read or write remote memory without kernel intervention, preserving the high bandwidth and low latency of a bus while reducing CPU load. This makes IB especially suitable for storage‑heavy clusters and high‑performance computing.
InfiniBand Protocol Stack
Physical Layer : Serial data streams over up to four links, each supporting speeds such as 56 Gb/s per lane.
Link Layer : Credit‑Based Flow Control ensures the receiver has enough buffer before transmission; supports QoS via Virtual Lanes (VL0‑VL15) and priority scheduling (SL).
Network Layer : Uses a Global Route Header (GRH) with a 128‑bit IPv6‑style address to route packets across subnets.
Transport Layer : Handles packet distribution, channel multiplexing, segmentation and reassembly, and directs packets to Queue Pairs (QP). When payload exceeds the MTU, the layer fragments and reassembles the data.
Fabric Architecture
IB devices include Channel Adapters (CA), Switches, and Routers. A CA can be a Host Channel Adapter (HCA) for compute nodes or a Target Channel Adapter (TCA) for storage/I/O devices. Subnets can contain up to 65 536 nodes, each managed by a Subnet Manager that assigns LIDs and coordinates with the Subnet Management Agent.
Switches forward traffic based on Local Route Headers (LRH) and LIDs, while Routers connect different subnets using the GRH’s IPv6 address. The overall fabric follows a switched‑fabric topology, enabling direct, high‑speed paths between endpoints.
Mellanox Socket Direct Technology
Mellanox’s Socket Direct splits a PCIe x16 HCA into two PCIe x8 cards (Main and Auxiliary) and attaches each to a separate CPU socket in a dual‑socket server. This bypasses the inter‑processor bus, allowing each CPU to access the network directly via its own PCIe lane, which reduces inter‑CPU traffic, lowers latency, and improves overall system throughput.
The solution also includes a dedicated SAS cable linking the two PCIe cards, forming a unique network topology that offloads traffic from the CPU interconnect.
Performance Evaluation
Tests comparing a ConnectX‑based Socket Direct adapter in a dual‑socket server with a standard PCIe x16 100 Gb/s adapter (single‑socket) measured TCP throughput, latency, CPU utilization, and RDMA benchmarks. Results showed an average latency reduction of about 80 % for the Socket Direct configuration, along with higher throughput and lower CPU usage due to the direct PCIe‑to‑CPU path.
OpenFabrics Software Stack
The OpenFabrics Enterprise Distribution (OFED) provides kernel drivers, RDMA APIs, and support for various transports (iWARP, RoCE, InfiniBand). It enables high‑performance messaging (MPI), storage protocols (iSER, NFS‑RDMA, SRP), and integrates with Ethernet‑based fabrics, making it a versatile foundation for modern data‑center and HPC environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
