Artificial Intelligence 10 min read

Why InfiniBand Is the Secret Weapon for AIGC Training Performance

The article examines how InfiniBand’s specialized features—collective communication, in‑network computing, adaptive routing, congestion control, cut‑through forwarding, shallow buffering, and self‑healing—are optimized for large‑scale AI‑generated content (AIGC) training, delivering higher bandwidth, lower latency, and greater fault tolerance than Ethernet alternatives.

Architects' Tech Alliance

May 5, 2024

Why InfiniBand Is the Secret Weapon for AIGC Training Performance

Introduction

In AIGC training scenarios, customers with large budgets often select InfiniBand as the inter‑node networking solution for AI servers because it provides low‑latency, high‑bandwidth communication.

Collective Computational Power

Collective communication algorithms coordinate distributed nodes during model training, reducing communication overhead and accelerating convergence. NVIDIA’s NCCL (NVIDIA Collective Communication Library) implements all‑reduce, all‑gather, reduce, broadcast, reduce‑scatter, and point‑to‑point patterns. NCCL is optimized for PCIe and NVLink and can scale across multiple machines using NVSwitch, InfiniBand, or Ethernet.

In‑Network Computing (SHARP)

InfiniBand switches with NVIDIA Quantum support the Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). SHARP performs tree‑based aggregation inside the switch, allowing the switch to act as an aggregation node and reduce data only once per operation. This effectively doubles the bandwidth available to NCCL on a 2400 Gb/s network compared with an 800 Gb/s network without SHARP.

Adaptive Routing

InfiniBand operates as a software‑defined network managed by a Subnet Manager (SM). The SM configures switches to select the least‑loaded output port based on queue depth and path priority, distributing traffic across all links. Although this can cause out‑of‑order packet delivery, InfiniBand hardware includes mechanisms to reorder packets correctly.

Congestion Control

InfiniBand uses credit‑based flow control and a three‑stage Congestion Control Architecture (CCA). When a switch detects congestion, it sets the Forward Explicit Congestion Notification (FECN) bit. The destination adapter responds with a Backward Explicit Congestion Notification (BECN), prompting the source adapter to throttle packet injection.

Cut‑Through Forwarding

InfiniBand switches employ cut‑through forwarding: the switch reads only the packet header, determines the egress port, and begins forwarding immediately. This reduces per‑hop latency to under 100 ns, which is critical for AI workloads where sub‑microsecond delays affect training speed.

Shallow Buffer Architecture

InfiniBand switches are designed with shallow buffers (megabytes) rather than the deep buffers (gigabytes) typical of Ethernet switches. Shallow buffers minimize tail latency and jitter, preserving the worst‑case latency guarantees required by AI training workloads.

Fault Recovery

InfiniBand switches feature self‑healing capabilities that quickly detect and correct link failures, preventing costly retransmissions. This rapid recovery is especially beneficial for AI traffic, which is bursty and highly sensitive to network faults.

Source: NVIDIA InfiniBand: Advantages for AIGC (WeChat article)

AIGC AI training NCCL Congestion Control In‑network computing Infiniband Adaptive routing

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.