Fundamentals 11 min read

NVIDIA Quantum‑2 InfiniBand Platform: Technical Overview, Q&A, and Deployment Guidance

This article explains the growing demand for high‑performance computing, introduces NVIDIA's Quantum‑2 InfiniBand platform with its high‑speed, low‑latency capabilities, provides a curated list of related technical articles, and offers an extensive Q&A covering compatibility, cabling, UFM, PCIe limits, and best‑practice deployment for AI and HPC workloads.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
NVIDIA Quantum‑2 InfiniBand Platform: Technical Overview, Q&A, and Deployment Guidance

With the rapid advancement of big data and artificial intelligence, the need for high‑performance computing continues to rise. NVIDIA's Quantum‑2 InfiniBand platform delivers exceptional distributed‑compute performance, enabling high‑speed, low‑latency data transfer and processing.

Related technical articles include analyses of NVIDIA InfiniBand for AIGC, the evolution of switch technology in the era of large models, network configuration of the NVIDIA Blackwell platform, and CXL as a solution to the AI‑era memory wall.

The article also compiles a series of frequently asked questions (FAQ) about InfiniBand (IB) technology, providing concise answers:

Compatibility: The CX7 NDR200 QSFP112 port is compatible with HDR/EDR cables.

Connection to Quantum‑2 switches: CX7 NDR NICs use NVIDIA 400GBASE‑SR4 or 400GBASE‑DR4 optical modules, while QM97XX series switches use 800GBASE‑SR8 or 800GBASE‑DR8 modules, linked via 12‑core multimode APC cables.

Port aggregation: A dual‑port 400G NIC cannot be bonded to achieve 800G due to PCIe 5.0 x16 bandwidth limits (max 512 Gbps).

Branch cable usage: 800G‑to‑2×400G branch cables must connect to two separate servers to avoid overloading a single Ethernet NIC.

Split‑cable configurations: Two types exist – optical modules that split 400G into 2×200G, and high‑speed branch cables that split 800G into 2×400G.

Superpod network wiring: Four NDR200 cards per server should not be linked with a single 1×4 cable to one switch; instead, use two 1×2 cables to different switches to comply with Superpod rules and ensure optimal NCCL/SHARP performance.

UFM deployment: While a separate UFM‑enabled IB switch is recommended, deploying UFM software only on a management node is possible but should not handle GPU compute workloads; storage networks remain independent.

UFM vs. OFED: OFED’s OpenSM provides basic management, but UFM adds a graphical UI and advanced features.

PCIe bandwidth limits: PCIe Gen5 offers up to 512 Gbps (32 GT/s × 16), whereas PCIe Gen4 caps at 256 Gbps.

Duplex mode: All IB NICs operate in full‑duplex; the concept of half‑duplex does not apply.

Cable distance: Optical modules with jumpers reach ~500 m; passive high‑speed cables ~3 m; active ACC cables ~5 m.

RoCE support: IB NICs can enable RoCE for 400G Ethernet connections, though performance may vary; NVIDIA Spectrum‑X is recommended for such deployments.

Module form factors: OSFP modules can be used with or without heatsinks; thick modules are acceptable on NICs with built‑in cooling.

UFM role: Runs as an independent node, can be made highly available with two servers, but should not be placed on compute nodes.

Additional sections discuss the differences between OSFP and QSFP112, the rationale behind limited OSFP ports on switches, and the suitability of various cable types for both IB and Ethernet 400G links.

The article concludes with promotional information about a comprehensive technical documentation package for architects, offering PDFs and PPTs covering a wide range of infrastructure topics.

network architectureAIHigh Performance ComputingGPUNvidiaInfiniBand
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.