Tagged articles
91 articles
Page 1 of 1
Architects' Tech Alliance
Architects' Tech Alliance
May 14, 2026 · Artificial Intelligence

Jensen Huang’s China Visit: Could It Revive GPU Prospects? Inside Nvidia’s DGX H200 Cluster Design

The article reviews the US‑approved export of Nvidia's DGX H200, the lack of deliveries, Jensen Huang’s surprise China trip that may speed approvals, and then provides a detailed technical breakdown of the DGX H200 cluster’s compute and storage networking, topology, optical link choices, and cable count estimates.

AI InfrastructureDGX H200Data Center Networking
0 likes · 8 min read
Jensen Huang’s China Visit: Could It Revive GPU Prospects? Inside Nvidia’s DGX H200 Cluster Design
Architects' Tech Alliance
Architects' Tech Alliance
Oct 12, 2025 · Artificial Intelligence

How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects

This article explains how InfiniBand’s architecture, native RDMA, GPUDirect, and evolving bandwidth enable ultra‑low‑latency, high‑throughput communication for AI model training, compares it with Ethernet, and details the role of RoCEv2 and other high‑performance interconnect technologies.

AI trainingGPU interconnectHigh‑Performance Networking
0 likes · 9 min read
How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects
Architects' Tech Alliance
Architects' Tech Alliance
Jul 29, 2025 · Artificial Intelligence

Why NVIDIA Spectrum‑X and Quantum InfiniBand Are Redefining AI Data Center Networks

The article explains how AI‑driven data center networks must handle massive distributed workloads, why traditional Ethernet falls short, and how NVIDIA’s Spectrum‑X Ethernet and Quantum InfiniBand use loss‑less RDMA, dynamic routing, advanced congestion control, and hardware‑accelerated collective communication to deliver the bandwidth, latency, and scalability required for generative AI and large‑scale model training.

AIInfiniBandNvidia
0 likes · 8 min read
Why NVIDIA Spectrum‑X and Quantum InfiniBand Are Redefining AI Data Center Networks
Architects' Tech Alliance
Architects' Tech Alliance
Jul 19, 2025 · Artificial Intelligence

Best GPU Cluster Network for Large‑Scale AI: NVLink, InfiniBand, RoCE & DDC

This article compares the main networking technologies used in large‑scale AI GPU clusters—NVLink, InfiniBand, RoCE Ethernet, and the emerging DDC full‑schedule fabric—examining latency, lossless transmission, congestion control, cost, power and scalability to help engineers choose the optimal solution for training massive language models.

AI trainingDDCData center
0 likes · 15 min read
Best GPU Cluster Network for Large‑Scale AI: NVLink, InfiniBand, RoCE & DDC
Architects' Tech Alliance
Architects' Tech Alliance
Jul 7, 2025 · Operations

Choosing the Right AI Data Center Network: InfiniBand vs RoCE

This article outlines the high‑performance networking requirements for AI data center training, compares InfiniBand and RoCE solutions, discusses their advantages in bandwidth, latency, scalability and cost, and provides design guidelines for building scalable, low‑latency, non‑blocking AI‑centric network architectures.

AIData centerHigh‑performance computing
0 likes · 10 min read
Choosing the Right AI Data Center Network: InfiniBand vs RoCE
Architects' Tech Alliance
Architects' Tech Alliance
May 31, 2025 · Artificial Intelligence

GPU Cluster Scaling: Understanding Scale‑Up and Scale‑Out for AI Pods

This article explains the concepts of AI Pods and GPU clusters, compares vertical (scale‑up) and horizontal (scale‑out) expansion, describes XPU types, discusses internal and inter‑pod communication, and evaluates the benefits and drawbacks of each scaling approach along with relevant networking technologies.

AI PodsGPUInfiniBand
0 likes · 10 min read
GPU Cluster Scaling: Understanding Scale‑Up and Scale‑Out for AI Pods
Architects' Tech Alliance
Architects' Tech Alliance
May 26, 2025 · Fundamentals

Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training

The article explains how distributed AI training performance depends on reducing inter‑card communication latency, introduces RDMA technology and its implementations (InfiniBand, RoCEv2, iWARP), compares their latency and scalability against traditional TCP/IP, and outlines the hardware components and trade‑offs of InfiniBand and RoCEv2 networks.

Distributed TrainingInfiniBandRDMA
0 likes · 12 min read
Understanding RDMA, InfiniBand, and RoCEv2 for High‑Performance Distributed Training
Architects' Tech Alliance
Architects' Tech Alliance
May 23, 2025 · Artificial Intelligence

Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training

The whitepaper explains that AI model training and inference rely on massive data computation, with model sizes reaching billions of parameters, demanding low‑latency, high‑bandwidth, stable, scalable, and manageable networks; it compares RDMA‑based InfiniBand and RoCE solutions and offers design recommendations for future AI compute clusters.

AIHigh‑Performance NetworkingInfiniBand
0 likes · 10 min read
Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
May 15, 2025 · Industry Insights

Why InfiniBand Still Beats Ethernet: Deep Dive into RDMA, Omni‑Path, and Protocol Layers

This article provides a comprehensive technical analysis of InfiniBand architecture, its protocol stack, comparison with Ethernet‑based RDMA solutions like RoCE and iWARP, and an overview of Omni‑Path, highlighting performance advantages, design trade‑offs, and practical limitations.

High‑performance computingInfiniBandOmni‑Path
0 likes · 19 min read
Why InfiniBand Still Beats Ethernet: Deep Dive into RDMA, Omni‑Path, and Protocol Layers
Linux Kernel Journey
Linux Kernel Journey
May 8, 2025 · Artificial Intelligence

How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network

DeepSeek engineers highlighted Tencent’s open‑source TRMT and DeepEP contributions that boost GPU‑to‑GPU communication by up to 300%, double RoCE performance and add a further 30% gain on InfiniBand, while addressing lane‑utilization and CPU‑control bottlenecks through three targeted optimizations.

DeepEPDeepSeekGPU communication
0 likes · 6 min read
How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network
Tencent Tech
Tencent Tech
May 7, 2025 · Artificial Intelligence

How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks

Tencent engineers highlighted a massive speedup in DeepSeek’s open‑source DeepEP communication framework, revealing how their TRMT‑based optimizations—dynamic multi‑QP topology awareness, IBGDA‑driven CPU‑bypass, and atomic signaling—boost RoCE network throughput up to 300% and add another 30% gain when applied to InfiniBand, effectively doubling GPU communication performance for large AI models.

AI model trainingDeepEPGPU communication
0 likes · 8 min read
How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks
Architects' Tech Alliance
Architects' Tech Alliance
Mar 29, 2025 · Industry Insights

Why Network Becomes the New Bottleneck for AI Training and How InfiniBand vs RoCE Compare

AI large‑model training relies on GPU clusters, generating massive inter‑node traffic that turns network performance into the primary bottleneck, prompting a detailed comparison of InfiniBand and RoCE protocols, their histories, strengths, limitations, and the need for next‑generation network chip architectures.

AIData centerHPC
0 likes · 5 min read
Why Network Becomes the New Bottleneck for AI Training and How InfiniBand vs RoCE Compare
AI Cyberspace
AI Cyberspace
Feb 13, 2025 · Fundamentals

Understanding InfiniBand RDMA: Architecture, Advantages, and NVIDIA Quantum-2

InfiniBand RDMA, designed to network server buses, offers high bandwidth and ultra‑low latency through zero‑copy, kernel‑bypass communication, with a layered architecture (L1‑L5) and hardware components like Quantum‑2 Switch, ConnectX‑7 RNIC, and SHARP acceleration, supported by the Verbs API and OFED stack.

InfiniBandQuantum-2RDMA
0 likes · 25 min read
Understanding InfiniBand RDMA: Architecture, Advantages, and NVIDIA Quantum-2
Architects' Tech Alliance
Architects' Tech Alliance
Dec 8, 2024 · Industry Insights

Why InfiniBand Still Beats Ethernet: Deep Dive into RDMA, Omni‑Path, and iWARP

This article provides a comprehensive technical analysis of InfiniBand’s protocol layers, topology, and performance advantages, compares Omni‑Path’s architecture, explains RDMA fundamentals, and details Ethernet‑based RDMA protocols such as RoCE and iWARP, highlighting their trade‑offs and use cases.

High-Performance ComputingInfiniBandOmni‑Path
0 likes · 18 min read
Why InfiniBand Still Beats Ethernet: Deep Dive into RDMA, Omni‑Path, and iWARP
BirdNest Tech Talk
BirdNest Tech Talk
Dec 1, 2024 · Fundamentals

How to Exchange RDMA Connection Parameters: Methods, Pros, and Pitfalls

Establishing an RDMA connection requires exchanging key parameters such as LID, QP number, and memory keys, and this article systematically outlines the essential information, compares six exchange methods—from static configuration to distributed services—and evaluates their advantages, drawbacks, and suitable scenarios.

Distributed SystemsInfiniBandNetworking
0 likes · 7 min read
How to Exchange RDMA Connection Parameters: Methods, Pros, and Pitfalls
Architects' Tech Alliance
Architects' Tech Alliance
Nov 7, 2024 · Industry Insights

Why RDMA, InfiniBand, and RoCE Are Redefining High‑Performance Data Center Networks

This article examines the evolution from the OSI and TCP/IP models to RDMA‑based technologies, compares traditional three‑tier and leaf‑spine architectures, analyzes NVIDIA SuperPOD designs, and evaluates Ethernet, InfiniBand, and RoCE switches to guide high‑throughput, low‑latency data‑center networking decisions.

Data Center NetworkingHigh‑performance computingInfiniBand
0 likes · 13 min read
Why RDMA, InfiniBand, and RoCE Are Redefining High‑Performance Data Center Networks
Architects' Tech Alliance
Architects' Tech Alliance
Oct 11, 2024 · Industry Insights

Why Common Network Misconceptions Hurt AI Performance and How to Fix Them

The article explains how prevalent misunderstandings in data‑center network design—such as altering end‑to‑end link speeds, overlooking switch radix, and choosing inappropriate buffering architectures—can increase latency and reduce AI workload efficiency, and it outlines the benefits of InfiniBand, cut‑through switching, scalable radix, and resilient AI‑cloud management solutions.

AIBuffer ArchitectureCut-through Switching
0 likes · 9 min read
Why Common Network Misconceptions Hurt AI Performance and How to Fix Them
Architects' Tech Alliance
Architects' Tech Alliance
Sep 25, 2024 · Fundamentals

NVIDIA Quantum‑2 InfiniBand Platform: Technical Overview, Q&A, and Deployment Guidance

This article explains the growing demand for high‑performance computing, introduces NVIDIA's Quantum‑2 InfiniBand platform with its high‑speed, low‑latency capabilities, provides a curated list of related technical articles, and offers an extensive Q&A covering compatibility, cabling, UFM, PCIe limits, and best‑practice deployment for AI and HPC workloads.

AIGPUInfiniBand
0 likes · 11 min read
NVIDIA Quantum‑2 InfiniBand Platform: Technical Overview, Q&A, and Deployment Guidance
Architects' Tech Alliance
Architects' Tech Alliance
Sep 8, 2024 · Artificial Intelligence

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

The article surveys the network architectures and congestion‑control techniques used in massive GPU clusters—such as Byte’s megascale, Baidu HPN, Alibaba HPN7, and Tencent Xingmai 2.0—highlighting how high‑bandwidth, low‑latency designs and advanced RDMA technologies enable training of trillion‑parameter multimodal AI models.

Data centerGPU clustersHPN
0 likes · 11 min read
Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Aug 18, 2024 · Artificial Intelligence

RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

The article explains how RDMA technologies—including InfiniBand, RoCE, and iWARP—provide high‑throughput, low‑latency, CPU‑free data transfer for massive generative AI model training, compares their architectures, and discusses modern network designs and load‑balancing strategies to optimize AI‑focused data‑center networks.

AI trainingHigh‑Performance ComputingInfiniBand
0 likes · 11 min read
RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Aug 12, 2024 · Industry Insights

How Shanghai Jiao Tong University Built China’s First Campus‑Scale ARM HPC Cluster with Huawei Kunpeng

This article details Shanghai Jiao Tong University's design and deployment of the nation’s first campus‑level high‑performance computing cluster based on Huawei Kunpeng 920 ARM processors, covering background, user challenges, unified storage, network topology, containerized software delivery, and performance validation with LAMMPS and GATK.

ARMHPCInfiniBand
0 likes · 12 min read
How Shanghai Jiao Tong University Built China’s First Campus‑Scale ARM HPC Cluster with Huawei Kunpeng
Architects' Tech Alliance
Architects' Tech Alliance
Jul 15, 2024 · Industry Insights

Why Ethernet Is Overtaking InfiniBand in AI and Data Center Networks

The article analyzes the 2022 global and Chinese switch markets, explains how distributed computing and generative AI workloads rely on high‑performance switches, compares Ethernet and InfiniBand technologies—including bandwidth, latency, and cost factors—and outlines major vendor strategies and future trends in the networking industry.

AIData centerInfiniBand
0 likes · 14 min read
Why Ethernet Is Overtaking InfiniBand in AI and Data Center Networks
Architects' Tech Alliance
Architects' Tech Alliance
Jul 7, 2024 · Operations

Designing High‑Performance Cluster Networks for AI Large Models: InfiniBand vs RoCE

The article analyzes the networking challenges of AI super‑large models, comparing InfiniBand and RoCE technologies, and presents design guidelines for ultra‑scale, high‑bandwidth, low‑latency, and highly stable cluster interconnects to maximize GPU utilization and overall training efficiency.

AIGPU interconnectHigh‑Performance Computing
0 likes · 14 min read
Designing High‑Performance Cluster Networks for AI Large Models: InfiniBand vs RoCE
Architects' Tech Alliance
Architects' Tech Alliance
May 23, 2024 · Cloud Computing

Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing

This article analyzes traditional cloud data center network limitations for AI workloads and compares various high‑bandwidth, low‑latency architectures—including two‑layer and three‑layer fat‑tree designs, InfiniBand, and RoCE—providing best‑practice recommendations for building scalable, non‑blocking AI‑Pool networks.

AI computingFat-TreeGPU clusters
0 likes · 12 min read
Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2024 · Industry Insights

InfiniBand vs RoCEv2: Which High‑Performance Network Wins AI Compute?

With AI models growing to billions of parameters, the choice of high‑performance interconnect—InfiniBand or RoCEv2—directly impacts training speed, scalability, latency, and operational complexity, and this article analyzes their architectures, performance metrics, vendor ecosystems, and suitability for large‑scale AI clusters.

AIDistributed TrainingHigh‑performance computing
0 likes · 13 min read
InfiniBand vs RoCEv2: Which High‑Performance Network Wins AI Compute?
Architects' Tech Alliance
Architects' Tech Alliance
May 11, 2024 · Industry Insights

Why Network Interconnects Are the New Bottleneck for Large‑Model AI Training

The rapid growth of AI large‑model training and inference is driving unprecedented demand for compute and high‑speed networking, prompting a shift from traditional GPU clusters to super‑pooled intelligent computing centers that must balance multiple intra‑ and inter‑node interconnect solutions such as NVLink, OAM/UBB, InfiniBand and RoCEv2.

AIData centerInfiniBand
0 likes · 6 min read
Why Network Interconnects Are the New Bottleneck for Large‑Model AI Training
Architects' Tech Alliance
Architects' Tech Alliance
May 9, 2024 · Industry Insights

Why RoCE Is Reshaping High‑Performance Computing Networks

The article provides a detailed technical analysis of RoCE (RDMA over Converged Ethernet), its two protocol versions, packet overhead, congestion‑control mechanisms, Soft‑RoCE implementation, and the challenges and performance implications of deploying RoCE in modern HPC environments compared to InfiniBand and traditional Ethernet solutions.

HPCInfiniBandRDMA
0 likes · 17 min read
Why RoCE Is Reshaping High‑Performance Computing Networks
Architects' Tech Alliance
Architects' Tech Alliance
May 5, 2024 · Artificial Intelligence

Why InfiniBand Is the Secret Weapon for AIGC Training Performance

The article examines how InfiniBand’s specialized features—collective communication, in‑network computing, adaptive routing, congestion control, cut‑through forwarding, shallow buffering, and self‑healing—are optimized for large‑scale AI‑generated content (AIGC) training, delivering higher bandwidth, lower latency, and greater fault tolerance than Ethernet alternatives.

AI trainingAIGCAdaptive routing
0 likes · 10 min read
Why InfiniBand Is the Secret Weapon for AIGC Training Performance
Architects' Tech Alliance
Architects' Tech Alliance
May 3, 2024 · Fundamentals

From OSI Model to RDMA: High‑Performance Networking, Leaf‑Spine Architecture, and Switch Selection

This article examines the evolution of network protocols from the OSI seven‑layer model and TCP/IP to RDMA technologies such as InfiniBand and RoCE, compares traditional three‑tier and leaf‑spine data‑center designs, and evaluates Ethernet, InfiniBand, and RoCE switches for high‑throughput, low‑latency HPC environments.

Data center architectureInfiniBandLeaf-Spine
0 likes · 13 min read
From OSI Model to RDMA: High‑Performance Networking, Leaf‑Spine Architecture, and Switch Selection
Architects' Tech Alliance
Architects' Tech Alliance
May 1, 2024 · Industry Insights

How NVIDIA’s Blackwell Platform Redefines AI Supercomputing Networks

The article examines NVIDIA’s Blackwell platform network architecture, detailing the fifth‑generation NVLink, sixth‑generation PCIe, 800 Gb/s InfiniBand and Ethernet adapters, the DGX B200 and GB200 configurations, new IB and Ethernet switches, and the implications of increased optical module demands for large‑scale AI clusters.

AI supercomputingBlackwellDGX
0 likes · 10 min read
How NVIDIA’s Blackwell Platform Redefines AI Supercomputing Networks
Architects' Tech Alliance
Architects' Tech Alliance
Apr 28, 2024 · Industry Insights

Why RoCE v2 Is Outpacing InfiniBand for Modern Data Centers

This article provides an in‑depth technical analysis of RoCE v2, covering its architecture, NIC requirements, and detailed comparisons with InfiniBand across physical layers, protocol stacks, switching, congestion handling, routing, and topology, while also highlighting the UEC alliance’s new transport protocol initiative.

High‑performance computingInfiniBandRDMA
0 likes · 12 min read
Why RoCE v2 Is Outpacing InfiniBand for Modern Data Centers
360 Smart Cloud
360 Smart Cloud
Apr 25, 2024 · Cloud Native

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.

AI trainingCloud NativeHigh‑Performance Networking
0 likes · 12 min read
Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 21, 2024 · Fundamentals

Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training

This article explains how Remote Direct Memory Access (RDMA) technologies such as InfiniBand and RoCE bypass OS kernels to achieve ultra‑low latency and high bandwidth, discusses their hardware implementations, cost considerations, and their critical impact on large‑scale AI model training and HPC network design.

AIGPUHigh‑Performance Computing
0 likes · 11 min read
Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 18, 2024 · Industry Insights

Why InfiniBand Dominates Modern HPC: Speed, Latency, and Scalability Explained

This article provides a comprehensive technical overview of InfiniBand, covering its rapid adoption in top supercomputers, detailed performance advantages such as ultra‑high bandwidth, CPU offload, sub‑microsecond latency, flexible scalability, QoS, SHARP acceleration, and a comparison with Ethernet, Fibre Channel, and Omni‑Path, while also outlining HDR switch and NIC product families.

Data centerHDRHPC
0 likes · 20 min read
Why InfiniBand Dominates Modern HPC: Speed, Latency, and Scalability Explained
Architects' Tech Alliance
Architects' Tech Alliance
Apr 3, 2024 · Industry Insights

InfiniBand vs. RoCE v2: Choosing the Best Network for AI Data Centers

This article provides a detailed technical comparison between InfiniBand and RoCE v2, covering architecture, lossless transmission, adaptive routing, major vendors, performance, scalability, operational complexity, and cost considerations to help AI data center architects select the most suitable high‑performance network solution.

AI data centerHigh‑Performance NetworkingInfiniBand
0 likes · 13 min read
InfiniBand vs. RoCE v2: Choosing the Best Network for AI Data Centers
Architects' Tech Alliance
Architects' Tech Alliance
Feb 14, 2024 · Industry Insights

Why InfiniBand Is Outpacing Ethernet in High‑Performance Computing

This article provides a comprehensive overview of InfiniBand technology, covering its history, architecture, packet format, layer functions, switching mechanisms, and performance advantages over Ethernet, while highlighting its rapid growth and future prospects in HPC environments.

ComparisonHigh‑performance computingInfiniBand
0 likes · 15 min read
Why InfiniBand Is Outpacing Ethernet in High‑Performance Computing
Architects' Tech Alliance
Architects' Tech Alliance
Dec 24, 2023 · Artificial Intelligence

Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

This article examines the main GPU/TPU cluster networking options—including NVLink, InfiniBand, RoCE Ethernet Fabric, and DDC full‑schedule networks—explaining their latency, loss‑less transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

GPU networkingInfiniBandLLM training
0 likes · 18 min read
Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training
Architects' Tech Alliance
Architects' Tech Alliance
Dec 6, 2023 · Artificial Intelligence

The Relationship Between Switches, Network Protocols, and AI in Modern Data Centers

This article explains how network protocols and switch architectures—including OSI layers, TCP/IP, RDMA, InfiniBand, RoCE, and leaf‑spine designs—support high‑throughput, low‑latency AI and HPC workloads, compares Ethernet and InfiniBand markets, and examines NVIDIA’s Spectrum/X and SuperPOD solutions.

AIData Center NetworkingInfiniBand
0 likes · 11 min read
The Relationship Between Switches, Network Protocols, and AI in Modern Data Centers
Architects' Tech Alliance
Architects' Tech Alliance
Aug 10, 2023 · Industry Insights

InfiniBand vs RoCEv2: Which Network Powers AI Model Training?

This article examines the architecture of AI compute clusters, explaining offline training and inference pipelines, the role of RDMA, and the technical differences between InfiniBand and RoCEv2—including latency, bandwidth, scalability, cost, and vendor considerations—to help engineers choose the optimal high‑performance network for large‑model training.

AI computeDistributed TrainingHigh‑Performance Networking
0 likes · 13 min read
InfiniBand vs RoCEv2: Which Network Powers AI Model Training?
Architects' Tech Alliance
Architects' Tech Alliance
Jul 24, 2023 · Operations

NVIDIA Quantum‑2 InfiniBand Platform Overview and Technical Q&A

This article introduces NVIDIA's Quantum‑2 InfiniBand solution for high‑performance computing, explains its HDR 200 Gb/s architecture, and provides a comprehensive Q&A covering cable compatibility, SuperPod networking, UFM management, PCIe bandwidth, and RDMA support for both IB and Ethernet environments.

InfiniBandPCIeRDMA
0 likes · 9 min read
NVIDIA Quantum‑2 InfiniBand Platform Overview and Technical Q&A
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 16, 2023 · Cloud Computing

Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training

The article examines how Alibaba Cloud’s Predictable Network, InfiniBand versus Ethernet trade‑offs, and the HPN high‑performance network design together address the extreme bandwidth, latency, scalability and reliability requirements of modern large‑model AI training workloads in cloud data centers.

AI trainingHigh‑performance computingInfiniBand
0 likes · 24 min read
Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training
Baidu Geek Talk
Baidu Geek Talk
May 10, 2023 · Artificial Intelligence

Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization

Baidu’s AI infrastructure combines a massive InfiniBand‑linked GPU cluster, Kunlun chips, the PaddlePaddle framework, and the Wenxin model suite with 4D hybrid parallelism, elastic fault tolerance, and a two‑stage training pipeline to overcome computation, memory, and communication walls, delivering world‑leading MLPerf performance for large‑scale LLMs.

GPU clusterInfiniBandModel Training Optimization
0 likes · 15 min read
Baidu's AI Infrastructure for Large-Scale LLM Training: Architecture, Challenges, and Optimization
Architects' Tech Alliance
Architects' Tech Alliance
Apr 19, 2023 · Fundamentals

Implementation and Performance Evaluation of a Domestic ARM‑Based High‑Performance Computing Cluster at Shanghai Jiao Tong University

The article describes how Shanghai Jiao Tong University built a campus‑level HPC platform using Huawei Kunpeng 920 ARM processors, detailing system architecture, unified storage and scheduling, containerized software deployment, network topology, Lustre file system integration, and performance results of LAMMPS and GATK compared with traditional X86 clusters.

ARMHPCInfiniBand
0 likes · 11 min read
Implementation and Performance Evaluation of a Domestic ARM‑Based High‑Performance Computing Cluster at Shanghai Jiao Tong University
Open Source Linux
Open Source Linux
Apr 14, 2023 · Fundamentals

Why InfiniBand Is the Fastest Growing High‑Speed Interconnect for HPC

This article provides a comprehensive overview of InfiniBand technology, covering its history, architecture, packet structure, layer hierarchy, switching mechanisms, and performance advantages over Ethernet, highlighting its role as a low‑latency, high‑bandwidth solution for high‑performance computing.

High‑performance computingInfiniBandRDMA
0 likes · 14 min read
Why InfiniBand Is the Fastest Growing High‑Speed Interconnect for HPC
Architects' Tech Alliance
Architects' Tech Alliance
Mar 26, 2023 · Fundamentals

Comprehensive Overview of InfiniBand Technology and Architecture

This article provides an in‑depth examination of InfiniBand, covering its rapid development as a high‑bandwidth, low‑latency interconnect technology, the InfiniBand Trade Association, detailed packet structures, layered architecture, switching mechanisms, and a comparative analysis with Ethernet, highlighting its advantages for high‑performance computing.

Data TransferHPCHigh‑performance computing
0 likes · 14 min read
Comprehensive Overview of InfiniBand Technology and Architecture
Refining Core Development Skills
Refining Core Development Skills
Oct 24, 2022 · Fundamentals

Low‑Latency Network Architecture for High‑Frequency Trading

This article explains how high‑frequency trading firms achieve ultra‑low network latency by combining proximity deployment, dedicated links, microwave transmission, InfiniBand, low‑latency switches, kernel bypass, RDMA, TCP offload engines and FPGA acceleration, and summarizes the impact of each technique on overall request latency.

FPGAInfiniBandKernel Bypass
0 likes · 16 min read
Low‑Latency Network Architecture for High‑Frequency Trading
Architects' Tech Alliance
Architects' Tech Alliance
May 14, 2022 · Fundamentals

High‑Performance Computing Network Solutions: RoCE v2, RDMA, and InfiniBand Overview

The article explains how high‑performance computing (HPC) networks overcome TCP/IP limitations by using RDMA‑based technologies such as RoCE v1/v2 and InfiniBand, detailing their architectures, advantages, vendor implementations, and cost‑effective migration to Ethernet‑based solutions for GPU‑driven workloads.

HPCHighPerformanceComputingInfiniBand
0 likes · 7 min read
High‑Performance Computing Network Solutions: RoCE v2, RDMA, and InfiniBand Overview
Architects' Tech Alliance
Architects' Tech Alliance
Mar 4, 2022 · Operations

What Is InfiniBand RDMA and How to Configure It on RHEL 8?

This guide explains the fundamentals of InfiniBand and RDMA, details the InfiniBand Verbs API, outlines the steps required for kernel data handling, and provides practical configuration instructions for RoCE, IPoIB, and the subnet manager on Red Hat Enterprise Linux 8.

IPoIBInfiniBandNetwork Configuration
0 likes · 11 min read
What Is InfiniBand RDMA and How to Configure It on RHEL 8?
Architects' Tech Alliance
Architects' Tech Alliance
Apr 28, 2021 · Industry Insights

Why InfiniBand Is Outpacing Ethernet in High‑Performance Computing

The article provides a comprehensive technical overview of InfiniBand, covering its history, standards, architecture layers, packet format, performance advantages, and a detailed comparison with Ethernet, highlighting why it has become the preferred high‑speed interconnect for HPC workloads.

Data TransferHigh‑performance computingInfiniBand
0 likes · 15 min read
Why InfiniBand Is Outpacing Ethernet in High‑Performance Computing
Architects' Tech Alliance
Architects' Tech Alliance
Mar 7, 2021 · Fundamentals

Understanding RDMA: InfiniBand, iWARP, and RoCE Technologies and Their Differences

This article explains Remote Direct Memory Access (RDMA), its origins in InfiniBand, the Ethernet‑based variants iWARP and RoCE (including RoCEv1 and RoCEv2), compares their architectures, performance characteristics, and deployment requirements for high‑performance computing and data‑center networks.

High‑Performance NetworkingInfiniBandRDMA
0 likes · 11 min read
Understanding RDMA: InfiniBand, iWARP, and RoCE Technologies and Their Differences
Architects' Tech Alliance
Architects' Tech Alliance
Apr 3, 2020 · Industry Insights

Why InfiniBand Beats TCP/IP: Deep Dive into Architecture and Socket Direct

This article explains how InfiniBand’s RDMA‑based architecture, layered protocol stack, and Mellanox Socket Direct technology deliver far higher bandwidth, lower latency, and better CPU efficiency than traditional TCP/IP networks, and it presents performance test results that show up to an 80% latency reduction.

FabricHigh‑performance computingInfiniBand
0 likes · 11 min read
Why InfiniBand Beats TCP/IP: Deep Dive into Architecture and Socket Direct
Architects' Tech Alliance
Architects' Tech Alliance
Jul 18, 2019 · Fundamentals

Overview of OpenFabrics Enterprise Distribution (OFED) and InfiniBand Software Architecture

This article provides a comprehensive overview of the OpenFabrics Enterprise Distribution (OFED) and the InfiniBand software architecture, covering its history, components, middleware, protocol stack, and how it enables high‑performance, low‑latency networking for IP, storage, and compute applications.

High-Performance ComputingInfiniBandLinux
0 likes · 11 min read
Overview of OpenFabrics Enterprise Distribution (OFED) and InfiniBand Software Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Jun 13, 2019 · Fundamentals

Understanding OpenFabrics Enterprise Distribution (OFED) and the InfiniBand Software Architecture

This article explains the OpenFabrics Enterprise Distribution (OFED) ecosystem, its history, the InfiniBand hardware and software stack, key protocols such as IPoIB, SDP and iSER, and how these technologies enable high‑performance, low‑latency networking across Linux, Windows and virtualized environments.

High-Performance ComputingInfiniBandLinux
0 likes · 12 min read
Understanding OpenFabrics Enterprise Distribution (OFED) and the InfiniBand Software Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Jun 9, 2019 · Fundamentals

Detailed Overview of NVMe Architecture and NVMe over Fabrics

This article provides a comprehensive technical overview of NVMe architecture, the NVMe‑over‑Fabric extensions—including InfiniBand, RoCE, iWARP, Fibre Channel, and TCP—explaining their RDMA‑based advantages, protocol differences, and practical considerations for data‑center storage deployments.

Fibre ChannelInfiniBandNVMe
0 likes · 12 min read
Detailed Overview of NVMe Architecture and NVMe over Fabrics
Architects' Tech Alliance
Architects' Tech Alliance
Mar 11, 2019 · Fundamentals

Understanding Mellanox InfiniBand Technology and Its Role in High‑Performance Computing

The article explains Nvidia's $6.9 billion acquisition of Mellanox, outlines Mellanox's history and product portfolio, and provides a detailed overview of InfiniBand architecture, network topologies, protocols, and related software stacks such as OFED, highlighting their importance for data‑center, HPC, and cloud environments.

Data centerHigh‑Performance ComputingInfiniBand
0 likes · 14 min read
Understanding Mellanox InfiniBand Technology and Its Role in High‑Performance Computing
Architects' Tech Alliance
Architects' Tech Alliance
Feb 3, 2019 · Fundamentals

Understanding GPUDirect RDMA: Principles, Implementation, and Performance

This article explains the background of GPU communication, introduces DMA and RDMA fundamentals, describes how GPUDirect RDMA enables direct GPU-to-GPU memory access across machines, and presents performance results showing reduced latency and increased bandwidth for distributed deep‑learning training.

Deep LearningGPU communicationGPUDirect
0 likes · 7 min read
Understanding GPUDirect RDMA: Principles, Implementation, and Performance
Architects' Tech Alliance
Architects' Tech Alliance
Jan 13, 2019 · Fundamentals

Overview of InfiniBand Technology and Its Protocol Stack

This article provides a comprehensive overview of InfiniBand technology, covering its open‑standard architecture, history, OFED software stack, protocol layers, performance advantages over traditional storage networks, and its primary use cases in high‑performance computing and data‑center environments.

High-Performance ComputingInfiniBandNetworking
0 likes · 11 min read
Overview of InfiniBand Technology and Its Protocol Stack
Architects' Tech Alliance
Architects' Tech Alliance
Jan 10, 2019 · Fundamentals

Understanding RDMA: Principles, Advantages, and Implementation Details

This article explains the challenges of high‑performance computing and big‑data workloads on traditional TCP/IP stacks, introduces RDMA technology, its variants (InfiniBand, RoCE, iWARP), key protocols, hardware components, and how it achieves ultra‑low latency and high throughput with minimal CPU involvement.

InfiniBandNetwork ProtocolsRDMA
0 likes · 13 min read
Understanding RDMA: Principles, Advantages, and Implementation Details
Architects' Tech Alliance
Architects' Tech Alliance
Nov 25, 2018 · Industry Insights

Why RDMA Makes NVMe‑over‑Fabric Faster: A Deep Dive into Fabrics, FC, InfiniBand, RoCE and TCP

The article examines how NVMe‑over‑Fabric extends NVMe beyond PCIe using various fabrics—FC, InfiniBand, RoCE v2, iWARP and TCP—highlighting RDMA’s zero‑copy, kernel‑bypass and CPU‑free advantages, and comparing protocol differences, performance trade‑offs, and the evolution toward NVMe/TCP.

Fibre ChannelInfiniBandNVMe
0 likes · 13 min read
Why RDMA Makes NVMe‑over‑Fabric Faster: A Deep Dive into Fabrics, FC, InfiniBand, RoCE and TCP
Architects' Tech Alliance
Architects' Tech Alliance
Nov 7, 2018 · Fundamentals

Survey of Network Types and Vendors in High‑Performance Computing (HPC) Environments

The Intersect360 2016 survey of 474 HPC sites covering 723 compute systems, 633 storage systems and 638 LANs reveals that Ethernet and InfiniBand dominate system interconnect, storage and LAN networks, with Mellanox and Cisco accounting for over half of installations, while newer technologies such as 10 GE, 40 G, 56 G InfiniBand and Omni‑Path show evolving market shares driven by bandwidth and latency demands.

CiscoHPCInfiniBand
0 likes · 10 min read
Survey of Network Types and Vendors in High‑Performance Computing (HPC) Environments
Architects' Tech Alliance
Architects' Tech Alliance
Oct 31, 2018 · Fundamentals

Understanding InfiniBand: Architecture, Protocols, and Performance

InfiniBand is a high‑performance network protocol that uses credit‑based flow control and switched fabric architecture to provide low latency, high bandwidth, and reliable data transfer, offering advantages over TCP/IP such as reduced packet loss, efficient RDMA, and support for various upper‑layer protocols.

High‑performance computingInfiniBandRDMA
0 likes · 10 min read
Understanding InfiniBand: Architecture, Protocols, and Performance
Architects' Tech Alliance
Architects' Tech Alliance
Oct 28, 2018 · Fundamentals

Understanding OpenFabrics Enterprise Distribution (OFED) and InfiniBand Software Architecture

This article provides a comprehensive overview of OpenFabrics Enterprise Distribution (OFED), its history, component stack, and the layered InfiniBand software architecture, explaining how various protocols such as IPoIB, SDP, and iSER enable high‑performance, low‑latency networking for Linux and Windows applications.

High-Performance ComputingInfiniBandLinux
0 likes · 8 min read
Understanding OpenFabrics Enterprise Distribution (OFED) and InfiniBand Software Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Apr 22, 2018 · Fundamentals

An Overview of Remote Direct Memory Access (RDMA): Principles, Comparisons, and Implementations

This article provides a comprehensive overview of Remote Direct Memory Access (RDMA), detailing its underlying principles, performance advantages over traditional TCP/IP, various protocol families such as InfiniBand, RoCE, and iWARP, and their respective hardware and software requirements.

High‑performance computingInfiniBandLow latency
0 likes · 9 min read
An Overview of Remote Direct Memory Access (RDMA): Principles, Comparisons, and Implementations
Architects' Tech Alliance
Architects' Tech Alliance
Apr 8, 2018 · Fundamentals

Understanding High‑Performance Computing (HPC): Market Size, Technologies, Metrics, and Core Components

This article provides a comprehensive overview of high‑performance computing, covering its rapid market growth, definition, classification into high‑throughput and distributed computing, key hardware components such as CPUs, GPUs, memory types, networking technologies like InfiniBand, performance metrics, benchmarking tools, and parallel file systems.

GPUHPCHigh‑performance computing
0 likes · 11 min read
Understanding High‑Performance Computing (HPC): Market Size, Technologies, Metrics, and Core Components
Architects' Tech Alliance
Architects' Tech Alliance
Jun 23, 2017 · Fundamentals

Analysis of Intel Omni-Path vs. InfiniBand: Architecture, Products, and Performance

The article provides a detailed analysis of Intel’s Omni-Path and InfiniBand technologies, covering their histories, architectural differences, product lineups, performance benchmarks, and market positioning within high‑performance computing; it also examines the role of the InfiniBand Trade Association, the impact of acquisitions by Intel and Mellanox, and the future prospects of both interconnect solutions.

High-Performance ComputingInfiniBandIntel
0 likes · 9 min read
Analysis of Intel Omni-Path vs. InfiniBand: Architecture, Products, and Performance
Architects' Tech Alliance
Architects' Tech Alliance
Jun 8, 2017 · Cloud Computing

Mellanox InfiniBand Technology Overview: Architecture, Protocol Stack, and Product Portfolio

This article provides a comprehensive overview of Mellanox's InfiniBand solutions, covering the company's background, network architecture, routing algorithms, Fat‑Tree topology, the OFED software stack, management tools, MPI support, adapters, switches, routers, cables, and related products for high‑performance computing and cloud data centers.

Data Center NetworkingFat-TreeHigh-Performance Computing
0 likes · 21 min read
Mellanox InfiniBand Technology Overview: Architecture, Protocol Stack, and Product Portfolio
Architects' Tech Alliance
Architects' Tech Alliance
Jun 5, 2017 · Fundamentals

Overview of InfiniBand Technology: Development, Advantages, Architecture, Protocol Layers, and Applications

This article provides a comprehensive overview of InfiniBand technology, covering its history, performance advantages over traditional interconnects, architectural concepts, layered protocol specifications, and typical use cases in high‑performance computing and data‑center environments.

Data centerHigh-Performance ComputingInfiniBand
0 likes · 14 min read
Overview of InfiniBand Technology: Development, Advantages, Architecture, Protocol Layers, and Applications