Tagged articles
29 articles
Page 1 of 1
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Nov 10, 2025 · Cloud Computing

How Polar‑TCP Breaks Kernel Network Bottlenecks for Million‑IOPS Cloud Services

This article explains how traditional kernel network stacks struggle with modern cloud data‑center workloads and introduces Baidu Intelligent Cloud's Polar solution—Polar‑TCP and Polar‑RDMA—which combine user‑space DPDK drivers, a lightweight TCP stack, and an industrial‑grade RPC framework to achieve near‑RDMA performance while preserving ecosystem compatibility.

DPDKHigh‑Performance NetworkingNetwork Stack
0 likes · 24 min read
How Polar‑TCP Breaks Kernel Network Bottlenecks for Million‑IOPS Cloud Services
Architects' Tech Alliance
Architects' Tech Alliance
Oct 12, 2025 · Artificial Intelligence

How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects

This article explains how InfiniBand’s architecture, native RDMA, GPUDirect, and evolving bandwidth enable ultra‑low‑latency, high‑throughput communication for AI model training, compares it with Ethernet, and details the role of RoCEv2 and other high‑performance interconnect technologies.

AI trainingGPU interconnectHigh‑Performance Networking
0 likes · 9 min read
How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects
Architects' Tech Alliance
Architects' Tech Alliance
Oct 8, 2025 · Artificial Intelligence

What Is UALink? The Open High‑Performance Interconnect Shaping AI Accelerator Clusters

UALink is an open, high‑performance interconnect standard designed to link thousands of AI accelerators, offering NVLink‑level bandwidth, low latency, scalability, cost efficiency, and flexible topologies to meet the demanding communication needs of modern AI workloads.

AI interconnectHigh‑Performance NetworkingProtocol Stack
0 likes · 8 min read
What Is UALink? The Open High‑Performance Interconnect Shaping AI Accelerator Clusters
Architects' Tech Alliance
Architects' Tech Alliance
May 23, 2025 · Artificial Intelligence

Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training

The whitepaper explains that AI model training and inference rely on massive data computation, with model sizes reaching billions of parameters, demanding low‑latency, high‑bandwidth, stable, scalable, and manageable networks; it compares RDMA‑based InfiniBand and RoCE solutions and offers design recommendations for future AI compute clusters.

AIHigh‑Performance NetworkingInfiniBand
0 likes · 10 min read
Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training
BirdNest Tech Talk
BirdNest Tech Talk
Nov 20, 2024 · Industry Insights

Inside xAI’s 100k‑GPU Colossus: Supermicro Liquid‑Cooled Racks Explained

The article provides a detailed, step‑by‑step tour of xAI’s Colossus supercomputer— a $‑billion AI cluster built in 122 days with 100,000 NVIDIA H100 GPUs—covering Supermicro liquid‑cooled 4U racks, cooling distribution units, power and water infrastructure, storage nodes, CPU servers, 400 GbE networking, and the operational challenges of scaling such a massive system.

AI supercomputingColossusData center architecture
0 likes · 16 min read
Inside xAI’s 100k‑GPU Colossus: Supermicro Liquid‑Cooled Racks Explained
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 29, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training

The article details Baidu Baige 4.0’s four‑layer AI infrastructure—hardware, cluster components, training‑inference acceleration, and platform tools—highlighting its heterogeneous computing, high‑performance networking, fault‑tolerant communication library, and optimizations that boost large‑model training and inference efficiency.

AI InfrastructureHigh‑Performance Networkingheterogeneous computing
0 likes · 17 min read
How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Sep 15, 2024 · Industry Insights

How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture

This article analyzes the technical roadmap for upgrading AI super‑large GPU clusters to support trillion‑parameter multimodal models, covering single‑chip performance, super‑node scaling, DPU‑based compute fusion, energy‑efficient designs, converged storage, high‑throughput networking, and fault‑tolerant checkpoint strategies.

AI computeDPUGPU clusters
0 likes · 18 min read
How to Build a Super‑Scale AI Cluster: From GPU Power to DPU‑Driven Architecture
Architects' Tech Alliance
Architects' Tech Alliance
Sep 8, 2024 · Artificial Intelligence

Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training

The article surveys the network architectures and congestion‑control techniques used in massive GPU clusters—such as Byte’s megascale, Baidu HPN, Alibaba HPN7, and Tencent Xingmai 2.0—highlighting how high‑bandwidth, low‑latency designs and advanced RDMA technologies enable training of trillion‑parameter multimodal AI models.

Data centerGPU clustersHPN
0 likes · 11 min read
Design and Architecture of Multi‑Million GPU Clusters for Large‑Scale AI Model Training
Architects' Tech Alliance
Architects' Tech Alliance
May 23, 2024 · Cloud Computing

Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing

This article analyzes traditional cloud data center network limitations for AI workloads and compares various high‑bandwidth, low‑latency architectures—including two‑layer and three‑layer fat‑tree designs, InfiniBand, and RoCE—providing best‑practice recommendations for building scalable, non‑blocking AI‑Pool networks.

AI computingFat-TreeGPU clusters
0 likes · 12 min read
Design and Comparison of High‑Performance Cloud Data Center Networks for AI Computing
ByteDance SYS Tech
ByteDance SYS Tech
Apr 26, 2024 · Backend Development

How io_uring Integration Boosts Netpoll Throughput and Slashes Latency

This article examines the integration of Linux io_uring into ByteDance's high‑performance Netpoll NIO library, detailing architectural changes, receive/send workflows, benchmarking methodology, and results that show over 10% higher throughput and 20‑40% lower latency while eliminating system calls.

BenchmarkGoHigh‑Performance Networking
0 likes · 18 min read
How io_uring Integration Boosts Netpoll Throughput and Slashes Latency
360 Smart Cloud
360 Smart Cloud
Apr 25, 2024 · Cloud Native

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.

AI trainingCloud NativeHigh‑Performance Networking
0 likes · 12 min read
Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 3, 2024 · Industry Insights

InfiniBand vs. RoCE v2: Choosing the Best Network for AI Data Centers

This article provides a detailed technical comparison between InfiniBand and RoCE v2, covering architecture, lossless transmission, adaptive routing, major vendors, performance, scalability, operational complexity, and cost considerations to help AI data center architects select the most suitable high‑performance network solution.

AI data centerHigh‑Performance NetworkingInfiniBand
0 likes · 13 min read
InfiniBand vs. RoCE v2: Choosing the Best Network for AI Data Centers
Linux Code Review Hub
Linux Code Review Hub
Feb 20, 2024 · Fundamentals

Why TCP Needs a Rethink: RDMA Insights and 800 Gbps Experiments

The talk examines the challenges of using standard Linux TCP for high‑performance data‑center workloads, explores how RDMA can provide zero‑copy and asynchronous kernel bypass, and presents experimental results from an FPGA‑based prototype that approaches 800 Gbps packet rates while highlighting congestion‑control and CPU‑utilization trade‑offs.

FPGAHigh‑Performance NetworkingKernel Bypass
0 likes · 23 min read
Why TCP Needs a Rethink: RDMA Insights and 800 Gbps Experiments
NetEase LeiHuo UX Big Data Technology
NetEase LeiHuo UX Big Data Technology
Jan 17, 2024 · Backend Development

Understanding DPDK: Background, Architecture, High‑Performance Techniques, and Real‑World Applications

This article explains the origins of DPDK, describes its modular architecture and performance‑enhancing mechanisms such as UIO, hugepages, and CPU affinity, and reviews popular user‑space networking frameworks like F‑Stack and Seastar that leverage DPDK for high‑throughput cloud services.

DPDKF-StackHigh‑Performance Networking
0 likes · 9 min read
Understanding DPDK: Background, Architecture, High‑Performance Techniques, and Real‑World Applications
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 11, 2023 · Cloud Computing

Alibaba Cloud Executive Discusses IPv6 Deployment, Global Collaboration, and AI‑Driven Network Evolution at the 2023 Wuzhen Internet Forum

In a detailed interview at the 2023 Wuzhen Internet Forum, Alibaba Cloud’s infrastructure lead Cai Dezhong outlines the three‑phase IPv6 rollout, highlights organizational and technical innovations, stresses the need for global cooperation, and explains how IPv6 underpins the next generation AI infrastructure and predictable high‑performance networking.

AI InfrastructureGlobal CollaborationHigh‑Performance Networking
0 likes · 9 min read
Alibaba Cloud Executive Discusses IPv6 Deployment, Global Collaboration, and AI‑Driven Network Evolution at the 2023 Wuzhen Internet Forum
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Sep 19, 2023 · Cloud Computing

AI‑Era Cloud Infrastructure: High Compute Density, Linear Scalability and Intelligent Operations – Highlights from the 2023 Open Data Center Conference

The 2023 Open Data Center Conference in Beijing showcased Alibaba Cloud's AI‑era infrastructure innovations—including high‑density compute clusters, predictable high‑performance networking, intelligent power‑simulation systems, battery diagnostics, liquid‑cooling solutions, and modular server standards—demonstrating how cloud platforms are being rebuilt to meet the demands of large AI models and sustainable operation.

AIHigh‑Performance NetworkingIntelligent Operations
0 likes · 10 min read
AI‑Era Cloud Infrastructure: High Compute Density, Linear Scalability and Intelligent Operations – Highlights from the 2023 Open Data Center Conference
Architects' Tech Alliance
Architects' Tech Alliance
Aug 10, 2023 · Industry Insights

InfiniBand vs RoCEv2: Which Network Powers AI Model Training?

This article examines the architecture of AI compute clusters, explaining offline training and inference pipelines, the role of RDMA, and the technical differences between InfiniBand and RoCEv2—including latency, bandwidth, scalability, cost, and vendor considerations—to help engineers choose the optimal high‑performance network for large‑model training.

AI computeDistributed TrainingHigh‑Performance Networking
0 likes · 13 min read
InfiniBand vs RoCEv2: Which Network Powers AI Model Training?
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jun 21, 2023 · Artificial Intelligence

How Baidu’s AIPod Network Powers Massive AI Model Training

This article explains the design and engineering of Baidu's AIPod high‑performance network, detailing the massive bandwidth, scalability, stability, and low‑latency requirements of large‑scale AI model training and the practical tools used to monitor and troubleshoot such workloads.

AIAIPodDistributed Training
0 likes · 22 min read
How Baidu’s AIPod Network Powers Massive AI Model Training
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 19, 2023 · Cloud Computing

Predictable Network: Alibaba Cloud’s Ethernet Edge for Faster AI Training

This article examines the challenges of scaling AI model training beyond single-chip limits, introduces Alibaba Cloud’s Predictable Network architecture—including high‑performance Ethernet, dual‑uplink, and adaptive routing—and compares its performance, scalability, and reliability against InfiniBand, showing how Ethernet can meet AI workloads with minimal loss.

AI trainingEthernet vs InfiniBandHigh‑Performance Networking
0 likes · 27 min read
Predictable Network: Alibaba Cloud’s Ethernet Edge for Faster AI Training
Tencent Cloud Developer
Tencent Cloud Developer
Dec 20, 2022 · Cloud Computing

HARP – Tencent Cloud's High‑Performance, Highly Available Network Transmission Protocol

HARP is Tencent Cloud's high-performance, highly available network transmission protocol that quickly reroutes around switch failures within 100 µs, offering zero packet loss, low latency, high bandwidth, scalable connections, and custom congestion control for storage, HPC, AI, and big data workloads.

Data centerHARPHigh‑Performance Networking
0 likes · 15 min read
HARP – Tencent Cloud's High‑Performance, Highly Available Network Transmission Protocol
Tencent Cloud Developer
Tencent Cloud Developer
Jun 6, 2022 · Cloud Computing

High‑Performance Network Solutions: RDMA, RoCE, iWARP and io_uring – Principles, Implementation and Benchmark Analysis

The article reviews high‑performance networking options—RDMA (including RoCE v2 and iWARP) and Linux’s io_uring—explaining their principles, hardware requirements, and benchmark results, and concludes that while RDMA delivers ultra‑low latency for specialized workloads, io_uring offers modest network benefits, leaving TCP as the default for most services.

BenchmarkHigh‑Performance NetworkingRDMA
0 likes · 10 min read
High‑Performance Network Solutions: RDMA, RoCE, iWARP and io_uring – Principles, Implementation and Benchmark Analysis
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2022 · Fundamentals

An Introduction to RDMA: Concepts, Advantages, Protocols, and Programming Basics

This article explains the fundamentals of Remote Direct Memory Access (RDMA), comparing it with traditional networking, outlining its core advantages, suitable use cases, the three main RDMA protocols (Infiniband, RoCE, iWARP), deployment requirements, communication flow, and essential programming concepts.

High‑Performance NetworkingLow latencyRDMA
0 likes · 9 min read
An Introduction to RDMA: Concepts, Advantages, Protocols, and Programming Basics
Architects' Tech Alliance
Architects' Tech Alliance
Mar 7, 2021 · Fundamentals

Understanding RDMA: InfiniBand, iWARP, and RoCE Technologies and Their Differences

This article explains Remote Direct Memory Access (RDMA), its origins in InfiniBand, the Ethernet‑based variants iWARP and RoCE (including RoCEv1 and RoCEv2), compares their architectures, performance characteristics, and deployment requirements for high‑performance computing and data‑center networks.

High‑Performance NetworkingInfiniBandRDMA
0 likes · 11 min read
Understanding RDMA: InfiniBand, iWARP, and RoCE Technologies and Their Differences
Architects' Tech Alliance
Architects' Tech Alliance
Nov 11, 2020 · Fundamentals

Understanding DPDK Memory Management: Large Pages, NUMA, DMA, and IOMMU

This article explains the core principles of DPDK memory management, covering standard huge pages, NUMA node binding, direct memory access, IOMMU and IOVA addressing, custom allocators, and memory pools, and how these mechanisms together enable high‑performance packet processing on Linux systems.

DMADPDKHigh‑Performance Networking
0 likes · 14 min read
Understanding DPDK Memory Management: Large Pages, NUMA, DMA, and IOMMU
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 9, 2019 · Cloud Computing

The Next Decade of Cloud Networking: Highlights from Alibaba Cloud Network Forum at Yunqi Conference 2019

The 2019 Yunqi Conference Cloud Network Forum gathered over two hundred network enthusiasts to review a decade of Alibaba data‑center networking evolution, explore emerging technologies such as AI, big data, and programmable chips, and outline the next ten years of high‑performance, data‑centric cloud networking.

Big DataHigh‑Performance Networkingnetwork architecture
0 likes · 9 min read
The Next Decade of Cloud Networking: Highlights from Alibaba Cloud Network Forum at Yunqi Conference 2019
Architects' Tech Alliance
Architects' Tech Alliance
Jul 5, 2019 · Backend Development

A Comprehensive Overview of DPDK and SPDK Technologies

This article provides an in‑depth technical overview of DPDK and SPDK, covering their background, the evolution of network I/O, Linux bottlenecks, user‑space I/O via UIO, poll‑mode drivers, performance‑optimizing techniques such as huge pages, SIMD, cache management, and the surrounding ecosystem and adoption.

DPDKHigh‑Performance NetworkingSPDK
0 likes · 15 min read
A Comprehensive Overview of DPDK and SPDK Technologies
Architects' Tech Alliance
Architects' Tech Alliance
Apr 8, 2019 · Fundamentals

Understanding RDMA: Principles, Advantages, and Implementation Details

This article explains how RDMA (Remote Direct Memory Access) technology, originating from InfiniBand and extended to Ethernet (RoCE) and TCP/IP (iWARP), provides ultra‑low latency, high throughput, and minimal CPU usage for high‑performance computing and big‑data applications by bypassing traditional OS and protocol stack processing.

High‑Performance NetworkingLow latencyRDMA
0 likes · 8 min read
Understanding RDMA: Principles, Advantages, and Implementation Details
Architects' Tech Alliance
Architects' Tech Alliance
Dec 4, 2018 · Fundamentals

Understanding RDMA High‑Performance Networks: Principles, Benefits, and Applications in Machine Learning

The article explains the background, architecture, and performance advantages of RDMA high‑performance networking, compares it with traditional TCP/IP, describes its deployment at Baidu for machine‑learning workloads, and outlines future use cases such as storage acceleration, GPU communication, and core services.

High‑Performance NetworkingRDMARoCE
0 likes · 12 min read
Understanding RDMA High‑Performance Networks: Principles, Benefits, and Applications in Machine Learning