Industry Insights 14 min read

Why RDMA Is Overtaking TCP/IP in HPC: OSI, Leaf‑Spine, and NVIDIA SuperPOD Explained

This article analyzes how the traditional OSI/TCP‑IP model is giving way to RDMA in high‑performance computing, compares Ethernet, InfiniBand and RoCE, evaluates leaf‑spine versus three‑tier data‑center designs, and examines NVIDIA SuperPOD architectures with detailed technical metrics.

Architects' Tech Alliance

Apr 21, 2025

Why RDMA Is Overtaking TCP/IP in HPC: OSI, Leaf‑Spine, and NVIDIA SuperPOD Explained

OSI Protocol and Transition to RDMA in HPC

Protocols define the rules for data exchange in computer networks; the OSI seven‑layer model, introduced in the 1980s, standardizes communication from the physical layer up to the application layer.

The physical layer specifies hardware signaling, while the data link layer handles framing and error control, the network layer manages logical routing, the transport layer ensures reliable delivery, the session layer coordinates connections, the presentation layer formats data, and the application layer provides services such as email and file transfer.

In practice, the TCP/IP suite collapses these layers into four: application, transport, network, and data link, optimizing for real‑world implementations.

High‑performance computing (HPC) demands high throughput and low latency, leading to the adoption of Remote Direct Memory Access (RDMA) technology, which bypasses the operating system to achieve faster, more efficient communication. RDMA does not prescribe a full protocol stack but requires strict transport characteristics such as minimal packet loss, high bandwidth, and low latency. Ethernet‑based RDMA variants include InfiniBand, RoCE, and iWARP, each with distinct technical and cost considerations.

Leaf‑Spine vs. Traditional Three‑Tier Architecture

Switches and gateways operate at different OSI layers: switches work at the data‑link layer using MAC addresses, while gateways function at the network layer with IP routing.

Traditional three‑tier data‑center designs consist of access, aggregation, and core layers. Access switches (TOR) connect directly to servers, aggregation switches bridge access and core, and core switches handle east‑west traffic.

These designs suffer from bandwidth waste due to spanning‑tree protocol blocking, large failure domains, and increased latency as traffic traverses multiple switches.

Leaf‑spine architecture flattens the network: leaf switches replace traditional L3 devices, and spine switches act as high‑speed L1 backbones. Equal‑cost multi‑path (ECMP) routing provides dynamic, non‑blocking paths, and the failure of a single spine only marginally reduces overall throughput.

NVIDIA SuperPOD Architecture Deep Dive

SuperPOD clusters connect multiple compute nodes for extreme throughput. The NVIDIA DGX A100 SuperPOD, for example, uses QM8790 switches with 40 × 200 Gbps ports.

Each DGX A100 node has eight NICs, each linking to a leaf switch. A SuperPOD comprising 20 servers forms a single SU (Super Unit), requiring eight leaf switches and five spine switches; scaling follows a 1:1.17 server‑to‑switch ratio for DGX A100, 1:1.34 for DGX H100, and 1:0.50 for larger H100 deployments.

The DGX H100 SuperPOD recommends QM9700 switches with 64 × 400 Gbps ports. NVIDIA’s Sharp technology introduces a Stream Aggregation Tree (SAT) that aggregates flows across multiple switches, reducing latency and improving performance. QM8700/8790+CX6 support up to two SATs, while QM9700/9790+CX7 support up to 64 SATs, allowing fewer switches as port counts rise.

Switch Selection: Ethernet, InfiniBand, and RoCE

Ethernet switches rely on TCP/IP, while InfiniBand and RoCE leverage RDMA for lower latency and higher bandwidth. All three can achieve up to 400 Gbps, but InfiniBand offers the greatest scalability, supporting tens of thousands of nodes in a single subnet.

Performance-wise, TCP/IP incurs CPU overhead and higher latency; RoCE improves efficiency on existing Ethernet infrastructure, whereas InfiniBand delivers superior raw throughput with dedicated hardware.

Management favors TCP/IP for its simplicity and widespread tooling, while InfiniBand and RoCE require specialized expertise.

Cost considerations show InfiniBand’s high‑end ports are expensive, making Ethernet‑based RoCE a more budget‑friendly alternative for many enterprises.

Device compatibility differs: RoCE and TCP/IP operate over standard Ethernet switches, whereas InfiniBand requires dedicated IB switches and compatible NICs, limiting flexibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

networking RDMA Data Center HPC InfiniBand RoCE Leaf-Spine

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.