Industry Insights 12 min read

Why RDMA Is Essential for Future HPC Data Centers: From TCP Limits to RoCEv2

The article analyzes how the shift of data centers toward compute-centric architectures drives the need for high‑performance networking, explains the shortcomings of TCP/IP, compares InfiniBand, iWarp and RoCE protocols, and shows why loss‑less Ethernet with RDMA is critical for modern HPC workloads.

Architects' Tech Alliance

Jul 18, 2022

Why RDMA Is Essential for Future HPC Data Centers: From TCP Limits to RoCEv2

As 5G, big data, IoT and AI become pervasive, society is moving toward an intelligent era where data centers evolve from resource‑scale to compute‑scale, making network performance a key factor in overall compute efficiency.

Single‑core scaling has stalled at 3 nm, and adding more cores increases power consumption, so the industry is turning to high‑performance computing (HPC) to meet growing compute demands. HPC workloads are expanding from P‑class to E‑class, requiring larger clusters and tighter compute‑network integration.

What Is HPC?

High‑Performance Computing (HPC) aggregates massive compute power to solve scientific, industrial, and simulation problems that ordinary workstations cannot handle. It distributes data and computation across many nodes to overcome the limits of a single machine.

Network Requirements for HPC

Loose‑coupled scenarios : Low inter‑node dependency, modest network performance needs (e.g., financial risk assessment, remote sensing, molecular dynamics).

Tight‑coupled scenarios : High synchronization and data‑exchange demands, requiring ultra‑low latency (e.g., electromagnetic simulation, fluid dynamics, automotive crash analysis).

Data‑intensive scenarios : Nodes process large data volumes and generate substantial intermediate data, needing high throughput and reasonable latency (e.g., weather forecasting, genome sequencing, graphics rendering, energy exploration).

Both high throughput and low latency are essential, and the industry typically adopts RDMA to achieve these goals.

Why RDMA?

Traditional TCP/IP stacks introduce tens of microseconds of latency per packet due to multiple context switches, data copies, and CPU‑bound protocol processing. This fixed latency becomes a bottleneck for microsecond‑scale AI computations and SSD‑based distributed storage. Moreover, TCP/IP consumes significant CPU resources; transmitting 1 bit can require roughly 1 Hz of CPU cycles, leading to CPU utilization exceeding 50 % at 25 Gbps line rates.

RDMA (Remote Direct Memory Access) bypasses the kernel stack, moving data directly between the memories of two machines without CPU involvement, achieving high bandwidth, low latency (≈1 µs), and minimal CPU load. For example, a 40 Gbps TCP/IP flow can saturate a server’s CPU, while the same flow over RDMA reduces CPU usage from 100 % to about 5 % and cuts latency from milliseconds to ~10 µs.

RDMA Network Protocol Options

InfiniBand : Designed specifically for RDMA, offering extremely high throughput and ultra‑low latency with lossless guarantees. However, it requires proprietary switches, lacks IP‑based interoperability, and suffers from vendor lock‑in and limited market share (<1 % of Ethernet deployments).

iWARP : Enables RDMA over TCP, allowing use of standard Ethernet switches but inherits TCP’s performance penalties, losing most of RDMA’s advantages.

RoCE (RDMA over Converged Ethernet): Extends RDMA to Ethernet. RoCEv1 operates at the link layer within a single broadcast domain, while RoCEv2 adds a routable network‑layer encapsulation over UDP. Both require NICs that support RoCE and Ethernet switches that provide lossless Ethernet to maintain RDMA performance.

InfiniBand delivers the best raw performance, but its proprietary nature limits adoption. iWARP sacrifices performance for compatibility. RoCE, especially RoCEv2, offers a practical compromise by leveraging existing Ethernet infrastructure while still requiring lossless Ethernet support.

Loss Sensitivity and the Need for Lossless Ethernet

RDMA is highly sensitive to packet loss: a loss rate above 0.001 % can cause throughput to drop sharply, and a 0.01 % loss rate can reduce throughput to zero. To keep RDMA throughput unaffected, loss rates must be below 1e‑5 (0.001 %). Since traditional Ethernet is best‑effort, switches must implement lossless Ethernet mechanisms (e.g., PFC) for RDMA to achieve its promised high‑throughput, low‑latency performance.

Consequently, deploying RDMA over RoCEv2 on lossless Ethernet has become the prevailing strategy for scaling high‑performance, distributed applications in modern data centers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Network RDMA Data Center HPC InfiniBand RoCE iWARP

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.