Tagged articles
13 articles
Page 1 of 1
Architects' Tech Alliance
Architects' Tech Alliance
Oct 12, 2025 · Artificial Intelligence

How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects

This article explains how InfiniBand’s architecture, native RDMA, GPUDirect, and evolving bandwidth enable ultra‑low‑latency, high‑throughput communication for AI model training, compares it with Ethernet, and details the role of RoCEv2 and other high‑performance interconnect technologies.

AI trainingGPU interconnectHigh‑Performance Networking
0 likes · 9 min read
How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects
Architects' Tech Alliance
Architects' Tech Alliance
May 26, 2025 · Fundamentals

Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training

The article explains how distributed AI training performance depends on reducing inter‑card communication latency, introduces RDMA technology and its implementations (InfiniBand, RoCEv2, iWARP), compares their latency and scalability against traditional TCP/IP, and outlines the hardware components and trade‑offs of InfiniBand and RoCEv2 networks.

Distributed TrainingInfiniBandRDMA
0 likes · 12 min read
Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training
AI Cyberspace
AI Cyberspace
Feb 24, 2025 · Cloud Computing

Scaling AI Training: Inside Large-Scale RDMA Networks and Modern Congestion Controls

This article explores the hardware and networking foundations for training massive AI models, detailing the challenges of large‑scale RDMA deployment, the evolution of congestion‑control algorithms like DCQCN, TIMELY, HPCC, and AWS's SRD, and how hardware offload and programmable switches enable scalable, low‑latency AI infrastructure.

AWS SRDDCQCNHPCC
0 likes · 14 min read
Scaling AI Training: Inside Large-Scale RDMA Networks and Modern Congestion Controls
AI Cyberspace
AI Cyberspace
Feb 22, 2025 · Cloud Computing

Why RoCEv2 Needs a Lossless Network and How to Achieve It

RoCE, originally built for InfiniBand, was adapted to Ethernet as RoCEv2, which uses IP/UDP headers to enable L3 routing but is highly sensitive to packet loss, requiring a lossless network and employing technologies such as PFC, ECN, DCQCN, and multi‑path transmission to maintain high RDMA performance.

DCQCNECNPFC
0 likes · 17 min read
Why RoCEv2 Needs a Lossless Network and How to Achieve It
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2024 · Industry Insights

InfiniBand vs RoCEv2: Which High‑Performance Network Wins AI Compute?

With AI models growing to billions of parameters, the choice of high‑performance interconnect—InfiniBand or RoCEv2—directly impacts training speed, scalability, latency, and operational complexity, and this article analyzes their architectures, performance metrics, vendor ecosystems, and suitability for large‑scale AI clusters.

AIDistributed TrainingHigh‑performance computing
0 likes · 13 min read
InfiniBand vs RoCEv2: Which High‑Performance Network Wins AI Compute?
Architects' Tech Alliance
Architects' Tech Alliance
May 11, 2024 · Industry Insights

Why Network Interconnects Are the New Bottleneck for Large‑Model AI Training

The rapid growth of AI large‑model training and inference is driving unprecedented demand for compute and high‑speed networking, prompting a shift from traditional GPU clusters to super‑pooled intelligent computing centers that must balance multiple intra‑ and inter‑node interconnect solutions such as NVLink, OAM/UBB, InfiniBand and RoCEv2.

AIData centerInfiniBand
0 likes · 6 min read
Why Network Interconnects Are the New Bottleneck for Large‑Model AI Training
Linux Code Review Hub
Linux Code Review Hub
Mar 6, 2024 · Operations

Advanced Congestion Management Techniques for Lossless Ethernet Storage Networks

The article examines high‑level strategies for preventing and recovering from congestion in lossless Ethernet storage networks, including disconnecting faulty devices, early frame dropping, traffic isolation, endpoint notifications, rate limiting, pause‑timeout, PFC watchdog mechanisms, detailed Cisco configuration commands, and the benefits and limitations of each approach.

Cisco NexusCongestion ManagementECN
0 likes · 33 min read
Advanced Congestion Management Techniques for Lossless Ethernet Storage Networks
Architects' Tech Alliance
Architects' Tech Alliance
Aug 10, 2023 · Industry Insights

InfiniBand vs RoCEv2: Which Network Powers AI Model Training?

This article examines the architecture of AI compute clusters, explaining offline training and inference pipelines, the role of RDMA, and the technical differences between InfiniBand and RoCEv2—including latency, bandwidth, scalability, cost, and vendor considerations—to help engineers choose the optimal high‑performance network for large‑model training.

AI computeDistributed TrainingHigh‑Performance Networking
0 likes · 13 min read
InfiniBand vs RoCEv2: Which Network Powers AI Model Training?
Architects' Tech Alliance
Architects' Tech Alliance
Mar 31, 2021 · Operations

NVMe over RoCEv2 Network Architecture, Control Optimization Requirements, and Test Specification

This article details the NVMe‑over‑RoCEv2 network architecture, defines plug‑and‑play and fast‑fault detection mechanisms, outlines IP domain management, LLDP and state‑notification requirements, security considerations, and provides test scenarios and tools for validating high‑performance storage networking.

LLDPNVMeRoCEv2
0 likes · 14 min read
NVMe over RoCEv2 Network Architecture, Control Optimization Requirements, and Test Specification