Tagged articles
55 articles
Page 1 of 1
SuanNi
SuanNi
May 8, 2026 · Artificial Intelligence

How OpenAI’s MRC Protocol Redesigns Communication for 100,000‑GPU Clusters

OpenAI, together with AMD, Broadcom, Intel, Microsoft and Nvidia, introduced the Multipath Reliable Connection (MRC) protocol, which splits a single 800 Gb/s link into eight 100 Gb/s planes, enabling full‑mesh connectivity for over 100 k GPUs with fewer switches, lower cost, higher resilience, and dynamic load‑balancing that eliminates congestion and hardware‑failure impacts during large‑scale AI training.

AI networkingGPU clustersMRC
0 likes · 12 min read
How OpenAI’s MRC Protocol Redesigns Communication for 100,000‑GPU Clusters
AI Info Trend
AI Info Trend
Aug 21, 2025 · Industry Insights

How BCG's C‑Curve Can Turn Low‑Return Companies into Growth Engines

BCG’s new “C‑Curve” framework reveals that roughly one‑seventh of listed companies suffer persistently low ROCE, and outlines a three‑step path—shrink to core, improve margins, then grow—that enables leaders to revive performance and create lasting shareholder value.

BCGC CurveCorporate Strategy
0 likes · 7 min read
How BCG's C‑Curve Can Turn Low‑Return Companies into Growth Engines
Architects' Tech Alliance
Architects' Tech Alliance
Jul 19, 2025 · Artificial Intelligence

Best GPU Cluster Network for Large‑Scale AI: NVLink, InfiniBand, RoCE & DDC

This article compares the main networking technologies used in large‑scale AI GPU clusters—NVLink, InfiniBand, RoCE Ethernet, and the emerging DDC full‑schedule fabric—examining latency, lossless transmission, congestion control, cost, power and scalability to help engineers choose the optimal solution for training massive language models.

AI trainingDDCData center
0 likes · 15 min read
Best GPU Cluster Network for Large‑Scale AI: NVLink, InfiniBand, RoCE & DDC
Architects' Tech Alliance
Architects' Tech Alliance
Jul 7, 2025 · Operations

Choosing the Right AI Data Center Network: InfiniBand vs RoCE

This article outlines the high‑performance networking requirements for AI data center training, compares InfiniBand and RoCE solutions, discusses their advantages in bandwidth, latency, scalability and cost, and provides design guidelines for building scalable, low‑latency, non‑blocking AI‑centric network architectures.

AIData centerHigh‑performance computing
0 likes · 10 min read
Choosing the Right AI Data Center Network: InfiniBand vs RoCE
Architects' Tech Alliance
Architects' Tech Alliance
May 23, 2025 · Artificial Intelligence

Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training

The whitepaper explains that AI model training and inference rely on massive data computation, with model sizes reaching billions of parameters, demanding low‑latency, high‑bandwidth, stable, scalable, and manageable networks; it compares RDMA‑based InfiniBand and RoCE solutions and offers design recommendations for future AI compute clusters.

AIHigh‑Performance NetworkingInfiniBand
0 likes · 10 min read
Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
May 15, 2025 · Industry Insights

Why InfiniBand Still Beats Ethernet: Deep Dive into RDMA, Omni‑Path, and Protocol Layers

This article provides a comprehensive technical analysis of InfiniBand architecture, its protocol stack, comparison with Ethernet‑based RDMA solutions like RoCE and iWARP, and an overview of Omni‑Path, highlighting performance advantages, design trade‑offs, and practical limitations.

High‑performance computingInfiniBandOmni‑Path
0 likes · 19 min read
Why InfiniBand Still Beats Ethernet: Deep Dive into RDMA, Omni‑Path, and Protocol Layers
Linux Kernel Journey
Linux Kernel Journey
May 8, 2025 · Artificial Intelligence

How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network

DeepSeek engineers highlighted Tencent’s open‑source TRMT and DeepEP contributions that boost GPU‑to‑GPU communication by up to 300%, double RoCE performance and add a further 30% gain on InfiniBand, while addressing lane‑utilization and CPU‑control bottlenecks through three targeted optimizations.

DeepEPDeepSeekGPU communication
0 likes · 6 min read
How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network
Tencent Tech
Tencent Tech
May 7, 2025 · Artificial Intelligence

How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks

Tencent engineers highlighted a massive speedup in DeepSeek’s open‑source DeepEP communication framework, revealing how their TRMT‑based optimizations—dynamic multi‑QP topology awareness, IBGDA‑driven CPU‑bypass, and atomic signaling—boost RoCE network throughput up to 300% and add another 30% gain when applied to InfiniBand, effectively doubling GPU communication performance for large AI models.

AI model trainingDeepEPGPU communication
0 likes · 8 min read
How Tencent’s DeepEP Doubles GPU Communication Speed on RoCE Networks
Architects' Tech Alliance
Architects' Tech Alliance
Mar 29, 2025 · Industry Insights

Why Network Becomes the New Bottleneck for AI Training and How InfiniBand vs RoCE Compare

AI large‑model training relies on GPU clusters, generating massive inter‑node traffic that turns network performance into the primary bottleneck, prompting a detailed comparison of InfiniBand and RoCE protocols, their histories, strengths, limitations, and the need for next‑generation network chip architectures.

AIData centerHPC
0 likes · 5 min read
Why Network Becomes the New Bottleneck for AI Training and How InfiniBand vs RoCE Compare
Architects' Tech Alliance
Architects' Tech Alliance
Nov 7, 2024 · Industry Insights

Why RDMA, InfiniBand, and RoCE Are Redefining High‑Performance Data Center Networks

This article examines the evolution from the OSI and TCP/IP models to RDMA‑based technologies, compares traditional three‑tier and leaf‑spine architectures, analyzes NVIDIA SuperPOD designs, and evaluates Ethernet, InfiniBand, and RoCE switches to guide high‑throughput, low‑latency data‑center networking decisions.

Data Center NetworkingHigh‑performance computingInfiniBand
0 likes · 13 min read
Why RDMA, InfiniBand, and RoCE Are Redefining High‑Performance Data Center Networks
Architects' Tech Alliance
Architects' Tech Alliance
Sep 8, 2024 · Artificial Intelligence

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

The article surveys the network architectures and congestion‑control techniques used in massive GPU clusters—such as Byte’s megascale, Baidu HPN, Alibaba HPN7, and Tencent Xingmai 2.0—highlighting how high‑bandwidth, low‑latency designs and advanced RDMA technologies enable training of trillion‑parameter multimodal AI models.

Data centerGPU clustersHPN
0 likes · 11 min read
Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Aug 18, 2024 · Artificial Intelligence

RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

The article explains how RDMA technologies—including InfiniBand, RoCE, and iWARP—provide high‑throughput, low‑latency, CPU‑free data transfer for massive generative AI model training, compares their architectures, and discusses modern network designs and load‑balancing strategies to optimize AI‑focused data‑center networks.

AI trainingHigh‑Performance ComputingInfiniBand
0 likes · 11 min read
RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Aug 1, 2024 · Industry Insights

Why RDMA and RoCE Are Becoming Critical Enablers for AI/ML Deployments

The article analyzes how the rapid shift of data‑center spending toward AI/ML has accelerated RDMA and RoCE adoption, outlines market forecasts through 2028, explains the technical advantages of direct memory access, and examines the evolving server, NIC, and backend‑network landscapes that will shape future AI workloads.

AI/MLData centerRDMA
0 likes · 12 min read
Why RDMA and RoCE Are Becoming Critical Enablers for AI/ML Deployments
Open Source Linux
Open Source Linux
Jul 24, 2024 · Artificial Intelligence

Why RDMA Is the Secret Engine Powering AI/ML Data Center Growth

The article explains how RDMA and RoCE technologies, originally built for high‑performance computing, are rapidly expanding in AI/ML data centers, driving massive market growth, faster GPU communication, and lower job completion times as server designs evolve toward higher GPU counts and faster NICs.

AI/MLMarket TrendsRDMA
0 likes · 10 min read
Why RDMA Is the Secret Engine Powering AI/ML Data Center Growth
Architects' Tech Alliance
Architects' Tech Alliance
Jul 7, 2024 · Operations

Designing High‑Performance Cluster Networks for AI Large Models: InfiniBand vs RoCE

The article analyzes the networking challenges of AI super‑large models, comparing InfiniBand and RoCE technologies, and presents design guidelines for ultra‑scale, high‑bandwidth, low‑latency, and highly stable cluster interconnects to maximize GPU utilization and overall training efficiency.

AIGPU interconnectHigh‑Performance Computing
0 likes · 14 min read
Designing High‑Performance Cluster Networks for AI Large Models: InfiniBand vs RoCE
Architects' Tech Alliance
Architects' Tech Alliance
Jul 6, 2024 · Industry Insights

Why Ethernet Struggles with AI Workloads and How Adaptive Routing Solves It

The article analyzes how AI‑driven elephant flows overload traditional Ethernet networks, causing long‑tail latency and victim‑flow congestion, and explains how adaptive routing, RDMA/ RoCE features, advanced congestion‑control algorithms, and high‑capacity switch chips can mitigate these challenges.

AI computingAdaptive routingElephant flow
0 likes · 7 min read
Why Ethernet Struggles with AI Workloads and How Adaptive Routing Solves It
Architects' Tech Alliance
Architects' Tech Alliance
May 23, 2024 · Cloud Computing

Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing

This article analyzes traditional cloud data center network limitations for AI workloads and compares various high‑bandwidth, low‑latency architectures—including two‑layer and three‑layer fat‑tree designs, InfiniBand, and RoCE—providing best‑practice recommendations for building scalable, non‑blocking AI‑Pool networks.

AI computingFat-TreeGPU clusters
0 likes · 12 min read
Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing
Architects' Tech Alliance
Architects' Tech Alliance
May 9, 2024 · Industry Insights

Why RoCE Is Reshaping High‑Performance Computing Networks

The article provides a detailed technical analysis of RoCE (RDMA over Converged Ethernet), its two protocol versions, packet overhead, congestion‑control mechanisms, Soft‑RoCE implementation, and the challenges and performance implications of deploying RoCE in modern HPC environments compared to InfiniBand and traditional Ethernet solutions.

HPCInfiniBandRDMA
0 likes · 17 min read
Why RoCE Is Reshaping High‑Performance Computing Networks
360 Smart Cloud
360 Smart Cloud
Apr 25, 2024 · Cloud Native

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.

AI trainingCloud NativeHigh‑Performance Networking
0 likes · 12 min read
Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 21, 2024 · Fundamentals

Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training

This article explains how Remote Direct Memory Access (RDMA) technologies such as InfiniBand and RoCE bypass OS kernels to achieve ultra‑low latency and high bandwidth, discusses their hardware implementations, cost considerations, and their critical impact on large‑scale AI model training and HPC network design.

AIGPUHigh‑Performance Computing
0 likes · 11 min read
Understanding RDMA: InfiniBand, RoCE, and Their Role in High‑Performance AI Model Training
Linux Code Review Hub
Linux Code Review Hub
Apr 7, 2024 · Industry Insights

A Decade of RDMA: Lessons Learned from Protocol Evolution

The article reviews ten years of RDMA development, tracing its origins, the rise and pitfalls of RoCEv1/v2, alternative approaches like iWARP and Cisco usNIC, and recent modernizations such as AWS SRD, Google Falcon and UltraEthernet, highlighting why protocol design choices have repeatedly stalled industry progress.

AI AcceleratorsData Center NetworkingRDMA
0 likes · 27 min read
A Decade of RDMA: Lessons Learned from Protocol Evolution
Architects' Tech Alliance
Architects' Tech Alliance
Dec 24, 2023 · Artificial Intelligence

Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training

This article examines the main GPU/TPU cluster networking options—including NVLink, InfiniBand, RoCE Ethernet Fabric, and DDC full‑schedule networks—explaining their latency, loss‑less transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

GPU networkingInfiniBandLLM training
0 likes · 18 min read
Overview of Popular GPU/TPU Cluster Networking Technologies for LLM Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 12, 2023 · Fundamentals

Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Case Studies

This article examines the RoCE protocol—an RDMA‑enabled Ethernet technology—its evolution, technical details, congestion‑control mechanisms, performance comparisons with InfiniBand, practical deployment issues in HPC clusters, and real‑world case studies such as Slingshot and application benchmarks.

HPCRDMARoCE
0 likes · 19 min read
Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Case Studies
Architects' Tech Alliance
Architects' Tech Alliance
Dec 18, 2022 · Cloud Computing

Hyper‑Converged Data Center Network Architecture and Its Impact on Computational Efficiency

The article explains how hyper‑converged, lossless Ethernet networks integrate storage, high‑performance and general‑purpose compute zones, improve computational efficiency (CE) by reducing latency and power consumption, and outlines emerging technologies such as RoCE, NVMe‑over‑Fabric, PCIe‑free CPU/GPU designs, IPv6 deployment, and AI‑driven traffic management for modern data centers.

Computational EfficiencyHyper-ConvergedNVMe over Fabrics
0 likes · 11 min read
Hyper‑Converged Data Center Network Architecture and Its Impact on Computational Efficiency
Architects' Tech Alliance
Architects' Tech Alliance
Sep 4, 2022 · Fundamentals

Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Case Studies

This article examines the RoCE protocol and its use in high‑performance computing, describing its low‑latency advantages, congestion‑control mechanisms, performance comparisons with InfiniBand, practical deployment issues, and real‑world case studies such as Slingshot and CESM/GROMACS benchmarks.

HPCRDMARoCE
0 likes · 18 min read
Applying RoCE (RDMA over Converged Ethernet) to High‑Performance Computing: Benefits, Challenges, and Case Studies
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2022 · Fundamentals

An Introduction to RDMA: Concepts, Advantages, Protocols, and Programming Basics

This article explains the fundamentals of Remote Direct Memory Access (RDMA), comparing it with traditional networking, outlining its core advantages, suitable use cases, the three main RDMA protocols (Infiniband, RoCE, iWARP), deployment requirements, communication flow, and essential programming concepts.

High‑Performance NetworkingLow latencyRDMA
0 likes · 9 min read
An Introduction to RDMA: Concepts, Advantages, Protocols, and Programming Basics
Architects' Tech Alliance
Architects' Tech Alliance
May 14, 2022 · Fundamentals

High‑Performance Computing Network Solutions: RoCE v2, RDMA, and InfiniBand Overview

The article explains how high‑performance computing (HPC) networks overcome TCP/IP limitations by using RDMA‑based technologies such as RoCE v1/v2 and InfiniBand, detailing their architectures, advantages, vendor implementations, and cost‑effective migration to Ethernet‑based solutions for GPU‑driven workloads.

HPCHighPerformanceComputingInfiniBand
0 likes · 7 min read
High‑Performance Computing Network Solutions: RoCE v2, RDMA, and InfiniBand Overview
Architects' Tech Alliance
Architects' Tech Alliance
Mar 4, 2022 · Operations

What Is InfiniBand RDMA and How to Configure It on RHEL 8?

This guide explains the fundamentals of InfiniBand and RDMA, details the InfiniBand Verbs API, outlines the steps required for kernel data handling, and provides practical configuration instructions for RoCE, IPoIB, and the subnet manager on Red Hat Enterprise Linux 8.

IPoIBInfiniBandNetwork Configuration
0 likes · 11 min read
What Is InfiniBand RDMA and How to Configure It on RHEL 8?
IT Architects Alliance
IT Architects Alliance
Jan 26, 2022 · Industry Insights

Why NVMe‑over‑RoCE Is the Future of All‑Flash Data Center Networks

The article explains how the rise of all‑flash data centers has driven the adoption of NVMe storage protocols, compares NVMe‑over‑FC, TCP, and RoCE, highlights RoCE’s performance and reliability advantages, and details Huawei’s NoF+ solution that enhances network performance, reliability, and ease of use for modern storage networks.

Data Center StorageHuaweiNVMe over Fabrics
0 likes · 11 min read
Why NVMe‑over‑RoCE Is the Future of All‑Flash Data Center Networks
Architects' Tech Alliance
Architects' Tech Alliance
Sep 9, 2021 · Fundamentals

Understanding DMA and RDMA: Principles, Advantages, and Protocols

This article explains the concepts of Direct Memory Access (DMA) and Remote Direct Memory Access (RDMA), compares traditional data transfer with DMA-enabled paths, outlines RDMA's advantages such as zero-copy and kernel bypass, and reviews the main RDMA protocols, standards bodies, and hardware ecosystem.

DMAHigh-Performance ComputingKernel Bypass
0 likes · 14 min read
Understanding DMA and RDMA: Principles, Advantages, and Protocols
Architects' Tech Alliance
Architects' Tech Alliance
Mar 7, 2021 · Fundamentals

Understanding RDMA: InfiniBand, iWARP, and RoCE Technologies and Their Differences

This article explains Remote Direct Memory Access (RDMA), its origins in InfiniBand, the Ethernet‑based variants iWARP and RoCE (including RoCEv1 and RoCEv2), compares their architectures, performance characteristics, and deployment requirements for high‑performance computing and data‑center networks.

High‑Performance NetworkingInfiniBandRDMA
0 likes · 11 min read
Understanding RDMA: InfiniBand, iWARP, and RoCE Technologies and Their Differences
Architects' Tech Alliance
Architects' Tech Alliance
Jan 10, 2021 · Industry Insights

Why RoCE Is Revolutionizing Data Center Networking: A Deep Dive into RDMA over Ethernet

This article explains the fundamentals of RDMA and RoCE, compares RoCE v1 and v2, outlines deployment steps, highlights performance benefits such as low CPU usage and zero‑copy, and answers common questions about its differences from iWARP and InfiniBand, helping data‑center engineers evaluate the technology.

Data Center NetworkingHigh BandwidthLow latency
0 likes · 8 min read
Why RoCE Is Revolutionizing Data Center Networking: A Deep Dive into RDMA over Ethernet
UCloud Tech
UCloud Tech
Jan 16, 2020 · Operations

How to Build a Low‑Latency, Lossless RoCE Network for High‑Performance Data Centers

This article explains how to design a low‑overhead, high‑performance lossless RoCE network for data centers, covering RDMA basics, mainstream network options, QoS, lossless and congestion‑control designs, buffer management, deadlock analysis, and practical tuning to achieve sub‑100 µs latency and near‑full bandwidth utilization.

Data Center NetworkingLossless EthernetQoS
0 likes · 21 min read
How to Build a Low‑Latency, Lossless RoCE Network for High‑Performance Data Centers
Architects' Tech Alliance
Architects' Tech Alliance
Apr 8, 2019 · Fundamentals

Understanding RDMA: Principles, Advantages, and Implementation Details

This article explains how RDMA (Remote Direct Memory Access) technology, originating from InfiniBand and extended to Ethernet (RoCE) and TCP/IP (iWARP), provides ultra‑low latency, high throughput, and minimal CPU usage for high‑performance computing and big‑data applications by bypassing traditional OS and protocol stack processing.

High‑Performance NetworkingLow latencyRDMA
0 likes · 8 min read
Understanding RDMA: Principles, Advantages, and Implementation Details
Architects' Tech Alliance
Architects' Tech Alliance
Dec 4, 2018 · Fundamentals

Understanding RDMA High‑Performance Networks: Principles, Benefits, and Applications in Machine Learning

The article explains the background, architecture, and performance advantages of RDMA high‑performance networking, compares it with traditional TCP/IP, describes its deployment at Baidu for machine‑learning workloads, and outlines future use cases such as storage acceleration, GPU communication, and core services.

High‑Performance NetworkingRDMARoCE
0 likes · 12 min read
Understanding RDMA High‑Performance Networks: Principles, Benefits, and Applications in Machine Learning
Architects' Tech Alliance
Architects' Tech Alliance
Nov 25, 2018 · Industry Insights

Why RDMA Makes NVMe‑over‑Fabric Faster: A Deep Dive into Fabrics, FC, InfiniBand, RoCE and TCP

The article examines how NVMe‑over‑Fabric extends NVMe beyond PCIe using various fabrics—FC, InfiniBand, RoCE v2, iWARP and TCP—highlighting RDMA’s zero‑copy, kernel‑bypass and CPU‑free advantages, and comparing protocol differences, performance trade‑offs, and the evolution toward NVMe/TCP.

Fibre ChannelInfiniBandNVMe
0 likes · 13 min read
Why RDMA Makes NVMe‑over‑Fabric Faster: A Deep Dive into Fabrics, FC, InfiniBand, RoCE and TCP
Architects' Tech Alliance
Architects' Tech Alliance
Apr 22, 2018 · Fundamentals

An Overview of Remote Direct Memory Access (RDMA): Principles, Comparisons, and Implementations

This article provides a comprehensive overview of Remote Direct Memory Access (RDMA), detailing its underlying principles, performance advantages over traditional TCP/IP, various protocol families such as InfiniBand, RoCE, and iWARP, and their respective hardware and software requirements.

High‑performance computingInfiniBandLow latency
0 likes · 9 min read
An Overview of Remote Direct Memory Access (RDMA): Principles, Comparisons, and Implementations