Tagged articles
13 articles
Page 1 of 1
SuanNi
SuanNi
May 8, 2026 · Artificial Intelligence

How OpenAI’s MRC Protocol Redesigns Communication for 100,000‑GPU Clusters

OpenAI, together with AMD, Broadcom, Intel, Microsoft and Nvidia, introduced the Multipath Reliable Connection (MRC) protocol, which splits a single 800 Gb/s link into eight 100 Gb/s planes, enabling full‑mesh connectivity for over 100 k GPUs with fewer switches, lower cost, higher resilience, and dynamic load‑balancing that eliminates congestion and hardware‑failure impacts during large‑scale AI training.

AI networkingGPU clustersMRC
0 likes · 12 min read
How OpenAI’s MRC Protocol Redesigns Communication for 100,000‑GPU Clusters
Architects' Tech Alliance
Architects' Tech Alliance
Apr 22, 2026 · Industry Insights

Why AI Supernodes and 10,000‑GPU Clusters Will Dominate 2025

The article analyzes how AI supernodes, massive GPU clusters, knowledge‑base activation, embodied intelligence, optical interconnect and open‑source agents like OpenClaw together form a complete AI industry ecosystem in 2025, highlighting performance breakthroughs, domestic competition, market share shifts, and emerging security concerns.

AI supernodesEmbodied IntelligenceGPU clusters
0 likes · 16 min read
Why AI Supernodes and 10,000‑GPU Clusters Will Dominate 2025
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 9, 2025 · Artificial Intelligence

How Baidu Built a 32,000‑Card AI Super‑Compute Cluster and Boosted Efficiency by 50%

This article details Baidu Intelligent Cloud's journey in designing, constructing, and operating a 32,000‑card hybrid AI compute cluster, covering challenges in power, cooling, networking, multi‑cluster scheduling, and security, and explains how innovative hardware, software, and operational strategies achieved over 50% MFU improvement and industry‑first performance records.

AI InfrastructureGPU clustershybrid cloud
0 likes · 15 min read
How Baidu Built a 32,000‑Card AI Super‑Compute Cluster and Boosted Efficiency by 50%
Architects' Tech Alliance
Architects' Tech Alliance
Jul 23, 2025 · Artificial Intelligence

Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?

The article explains how AI large‑model training demands massive GPU resources and how carefully designed network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, multi‑rail versus single‑rail connections, Dragonfly, and Torus—impact performance, scalability, cost, and reliability, guiding the selection of optimal data‑center networks.

AIData centerGPU clusters
0 likes · 9 min read
Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?
Kuaishou Tech
Kuaishou Tech
Nov 21, 2024 · Artificial Intelligence

Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters

This article summarizes the challenges of distributed training for massive language models and presents a suite of solutions—including DP/TP/PP overlap, context parallelism, efficient recomputation, and a performance‑aware cost model—that together boost training throughput by over 30% on large GPU clusters.

Distributed TrainingGPU clustersPerformance Modeling
0 likes · 27 min read
Best Practices for Training Large Language Models on Ultra‑Large Scale Clusters
Architects' Tech Alliance
Architects' Tech Alliance
Sep 15, 2024 · Industry Insights

How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture

This article analyzes the technical roadmap for upgrading AI super‑large GPU clusters to support trillion‑parameter multimodal models, covering single‑chip performance, super‑node scaling, DPU‑based compute fusion, energy‑efficient designs, converged storage, high‑throughput networking, and fault‑tolerant checkpoint strategies.

AI computeDPUGPU clusters
0 likes · 18 min read
How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Sep 8, 2024 · Artificial Intelligence

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

The article surveys the network architectures and congestion‑control techniques used in massive GPU clusters—such as Byte’s megascale, Baidu HPN, Alibaba HPN7, and Tencent Xingmai 2.0—highlighting how high‑bandwidth, low‑latency designs and advanced RDMA technologies enable training of trillion‑parameter multimodal AI models.

Data centerGPU clustersHPN
0 likes · 11 min read
Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Jul 1, 2024 · Industry Insights

Why Fat-Tree, Dragonfly, and Torus Topologies Matter for HPC Networks

The article analyzes three major high‑performance‑computing network topologies—Fat‑Tree, Dragonfly, and Torus—detailing their design principles, scalability formulas, routing strategies, advantages, and limitations to help architects choose the most suitable architecture for large‑scale GPU clusters.

DragonflyFat-TreeGPU clusters
0 likes · 13 min read
Why Fat-Tree, Dragonfly, and Torus Topologies Matter for HPC Networks
Architects' Tech Alliance
Architects' Tech Alliance
May 23, 2024 · Cloud Computing

Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing

This article analyzes traditional cloud data center network limitations for AI workloads and compares various high‑bandwidth, low‑latency architectures—including two‑layer and three‑layer fat‑tree designs, InfiniBand, and RoCE—providing best‑practice recommendations for building scalable, non‑blocking AI‑Pool networks.

AI computingFat-TreeGPU clusters
0 likes · 12 min read
Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing
Architects' Tech Alliance
Architects' Tech Alliance
Apr 6, 2024 · Artificial Intelligence

How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System

The article analyzes ByteDance and Peking University's MegaScale system that enables efficient, stable training of large language models on clusters exceeding ten thousand GPUs, detailing algorithmic tweaks, 3D parallel communication overlap, operator optimizations, data‑pipeline improvements, network tuning, and fault‑tolerance mechanisms that together achieve a 55.2% MFU on a 175B model.

Distributed SystemsGPU clustersLLM training
0 likes · 15 min read
How ByteDance Scaled LLM Training to Over 10,000 GPUs: Inside the MegaScale System
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 9, 2023 · Artificial Intelligence

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

This article explains how Baidu built a massive, high‑performance GPU/IB cluster, optimized its architecture and software stack, and integrated AI frameworks and resource management to overcome compute, memory, and communication bottlenecks, enabling efficient training of trillion‑parameter large models.

AI InfrastructureDistributed TrainingGPU clusters
0 likes · 19 min read
How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models
Tencent Cloud Developer
Tencent Cloud Developer
Mar 22, 2023 · Artificial Intelligence

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.

AI trainingGPU clustersRDMA
0 likes · 19 min read
Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training
Baidu Geek Talk
Baidu Geek Talk
Mar 21, 2023 · Artificial Intelligence

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.

AI InfrastructureCost ModelDistributed Training
0 likes · 27 min read
Infrastructure Challenges and Solutions for Large‑Scale AI Model Training