Tagged articles
9 articles
Page 1 of 1
AI Cyberspace
AI Cyberspace
Nov 19, 2025 · Artificial Intelligence

Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

This article explains how AI model training has evolved from single‑GPU workloads to massive distributed training using MPI for CPU‑centric communication and NCCL for GPU‑centric communication, covering their histories, core concepts, programming interfaces, topology discovery, protocol choices, and performance testing on multi‑GPU clusters.

AI distributed trainingGPU communicationHigh‑performance computing
0 likes · 71 min read
Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs
AI Cyberspace
AI Cyberspace
Mar 14, 2025 · Artificial Intelligence

How NCCL Accelerates Distributed AI Training on GPUs

This article explains the origins, core functions, installation steps, and programming examples of NVIDIA’s Collective Communication Library (NCCL), detailing its role in multi‑GPU and multi‑node AI distributed training, topology discovery, path selection, channel search, and various collective communication operations.

CUDAGPU communicationMPI
0 likes · 33 min read
How NCCL Accelerates Distributed AI Training on GPUs
Architects' Tech Alliance
Architects' Tech Alliance
Apr 17, 2023 · Fundamentals

Overview of High‑Performance Computing (HPC): Architecture, Metrics, Cluster Management, Job Scheduling, and Parallel Programming Models

This article provides a comprehensive overview of high‑performance computing, covering system architectures, hardware components, performance metrics, network topologies, common parallel file systems, cluster management functions, mainstream job‑scheduling systems, and MPI‑based parallel programming models.

ClusterHPCHigh‑performance computing
0 likes · 14 min read
Overview of High‑Performance Computing (HPC): Architecture, Metrics, Cluster Management, Job Scheduling, and Parallel Programming Models
Architects' Tech Alliance
Architects' Tech Alliance
May 3, 2022 · Fundamentals

High‑Performance Computing Overview and Resource Guide

This article provides a comprehensive overview of high‑performance computing (HPC), covering its definition, hardware architectures, performance metrics, cluster components, parallel file systems, management and scheduling tools, as well as common MPI implementations and links to further technical resources.

ClusterFLOPSFile Systems
0 likes · 11 min read
High‑Performance Computing Overview and Resource Guide
DataFunSummit
DataFunSummit
Nov 29, 2021 · Artificial Intelligence

Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention

This article reviews Horovod, a popular third‑party distributed deep‑learning training plugin, explaining its simple three‑line integration, the challenges of deadlocks in all‑reduce operations, and the architectural components—including background threads, coordinators, and MPI/Gloo controllers—that enable scalable and efficient data‑parallel training.

Data ParallelDeep LearningDistributed Training
0 likes · 8 min read
Horovod Distributed Training Plugin: Design, Usage, and Deadlock Prevention
Tencent Cloud Developer
Tencent Cloud Developer
May 22, 2020 · Artificial Intelligence

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

WeChat’s Scan‑to‑Identify system now trains its CNN models across multiple GPUs using Horovod’s data‑parallel, synchronous Ring All‑Reduce architecture built on MPI and NCCL, cutting training time from several days to under one day while maintaining accuracy, and future work will target I/O and further scaling.

AIDistributed TrainingHorovod
0 likes · 12 min read
Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL
21CTO
21CTO
Sep 19, 2015 · Artificial Intelligence

Why Distributed Machine Learning Needs More Data Than Speed

The article explains how distributed machine learning evolved from parallel computing to handle massive, long‑tail data sets, discusses the importance of scalability, fault recovery, and data‑parallel algorithms, and reviews frameworks such as MPI, MapReduce, and Pregel for building large‑scale AI systems.

Big DataData ParallelismLDA
0 likes · 24 min read
Why Distributed Machine Learning Needs More Data Than Speed
Efficient Ops
Efficient Ops
Jun 25, 2015 · Big Data

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

BaiduDAGHadoop
0 likes · 21 min read
Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing