Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

This article explains how AI model training has evolved from single‑GPU workloads to massive distributed training using MPI for CPU‑centric communication and NCCL for GPU‑centric communication, covering their histories, core concepts, programming interfaces, topology discovery, protocol choices, and performance testing on multi‑GPU clusters.

AI Cyberspace
AI Cyberspace
AI Cyberspace
Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

AI Distributed Training Overview

Early AI model training could fit on a single GPU, but modern large language models (LLMs) require thousands of GPUs across many servers, making distributed training essential. MPI (Message Passing Interface) and NCCL (NVIDIA Collective Communication Library) provide the communication backbone for such workloads.

MPI Origins and OpenMPI

In the 1990s, CPU‑centric supercomputers needed a standard for multi‑process communication, leading to the MPI standard in 1994. OpenMPI implements MPI for Linux, offering point‑to‑point and collective communication primitives such as MPI_Send , MPI_Recv , MPI_Bcast , MPI_Scatter , MPI_Gather , MPI_Allreduce , and others.

int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status* status);
int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm);
int MPI_Scatter(const void* sendbuf, int sendcount, MPI_Datatype sendtype,
                void* recvbuf, int recvcount, MPI_Datatype recvtype,
                int root, MPI_Comm comm);
int MPI_Gather(const void* sendbuf, int sendcount, MPI_Datatype sendtype,
               void* recvbuf, int recvcount, MPI_Datatype recvtype,
               int root, MPI_Comm comm);
int MPI_Allreduce(const void* sendbuf, void* recvbuf, int count,
                 MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);

Basic MPI Concepts

Process : An independent MPI rank that runs the same program.

Process Group : A set of processes that can communicate together.

Communicator : Defines the communication domain (e.g., MPI_COMM_WORLD or MPI_COMM_SELF).

Message : Data transferred between ranks, consisting of payload, source, destination, tag, and datatype.

Rank : The unique identifier of a process within a communicator.

Point‑to‑Point Communication Functions

int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);
int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status* status);
int MPI_Isend(...);   // non‑blocking send
int MPI_Irecv(...);   // non‑blocking receive
int MPI_Ssend(...);  // synchronous send
int MPI_Bsend(...);  // buffered send
int MPI_Rsend(...);  // ready send
int MPI_Sendrecv(...); // send and receive in one call
int MPI_Sendrecv_replace(...); // in‑place version

Collective Communication Functions

int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm);
int MPI_Scatter(const void* sendbuf, int sendcount, MPI_Datatype sendtype,
                void* recvbuf, int recvcount, MPI_Datatype recvtype,
                int root, MPI_Comm comm);
int MPI_Gather(const void* sendbuf, int sendcount, MPI_Datatype sendtype,
               void* recvbuf, int recvcount, MPI_Datatype recvtype,
               int root, MPI_Comm comm);
int MPI_Allgather(const void* sendbuf, int sendcount, MPI_Datatype sendtype,
                  void* recvbuf, int recvcount, MPI_Datatype recvtype,
                  MPI_Comm comm);
int MPI_Alltoall(const void* sendbuf, int sendcount, MPI_Datatype sendtype,
                 void* recvbuf, int recvcount, MPI_Datatype recvtype,
                 MPI_Comm comm);
int MPI_Reduce(const void* sendbuf, void* recvbuf, int count,
               MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm);
int MPI_Allreduce(const void* sendbuf, void* recvbuf, int count,
                  MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
int MPI_Reduce_scatter(const void* sendbuf, void* recvbuf, const int recvcounts[],
                       MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
int MPI_Scan(const void* sendbuf, void* recvbuf, int count,
             MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);
int MPI_Exscan(const void* sendbuf, void* recvbuf, int count,
               MPI_Datatype datatype, MPI_Op op, MPI_Comm comm);

Synchronization

The MPI_Barrier call blocks all processes in a communicator until every process reaches the barrier, ensuring that all previous communication has completed.

Running MPI Programs

MPI programs are launched with mpirun (or mpiexec) which creates the required number of processes across the specified hosts. Common options include --allow-run-as-root, --hostfile, --map-by, and --np to control process placement.

# Compile
mpicc -o mpi_concepts_example mpi_concepts_example.c
# Run with 4 processes
mpirun -np 4 ./mpi_concepts_example

Sample MPI Program Output

[进程 0]: 我是MPI并行程序中的一个进程,总共有 4 个进程
[进程 1]: 我是MPI并行程序中的一个进程,总共有 4 个进程
[进程 2]: 我是MPI并行程序中的一个进程,总共有 4 个进程
[进程 3]: 我是MPI并行程序中的一个进程,总共有 4 个进程
[进程 0]: 我被分配到偶数组(color=0), 在该组内我的序号是 0
[进程 2]: 我被分配到偶数组(color=0), 在该组内我的序号是 1
[进程 1]: 我被分配到奇数组(color=1), 在该组内我的序号是 0
[进程 3]: 我被分配到奇数组(color=1), 在该组内我的序号是 1
[进程 0]: 在全局通信域中我的序号是 0,在组内通信域中我的序号是 0
[进程 1]: 在全局通信域中我的序号是 1,在组内通信域中我的序号是 0
[进程 2]: 在全局通信域中我的序号是 2,在组内通信域中我的序号是 1
[进程 3]: 在全局通信域中我的序号是 3,在组内通信域中我的序号是 1
[进程 0]: 我所在的偶数组完成了广播,接收到的数据是 0
[进程 2]: 我所在的偶数组完成了广播,接收到的数据是 0
[进程 1]: 我所在的奇数组完成了广播,接收到的数据是 101
[进程 3]: 我所在的奇数组完成了广播,接收到的数据是 101
--- MPI消息传递开始 ---
[进程 0 (偶数组)]: 正准备发送消息到奇数组
[进程 0 (偶数组)]: 消息已发送
[进程 1 (奇数组)]: 收到消息: "你好,奇数组! 这条消息来自偶数组的进程 0"
[进程 1 (奇数组)]: 消息长度为 54 字符
[进程 0]: 通信已完成,准备退出
[进程 1]: 通信已完成,准备退出
[进程 2]: 通信已完成,准备退出
[进程 3]: 通信已完成,准备退出

NCCL Origin and Architecture

MPI was designed for CPUs and does not consider GPU‑specific latency, bandwidth, or direct memory access. NVIDIA created NCCL in 2015 to provide a GPU‑centric collective communication library that works over NVLink, PCIe, and RDMA networks. NCCL and MPI are complementary: MPI handles CPU‑side coordination, while NCCL handles GPU‑side data movement.

The NVIDIA Collective Communication Library (NCCL) implements multi‑GPU and multi‑node communication primitives optimized for NVIDIA GPUs and networking.

NCCL Software Stack

Northbound APIs : Exposed to AI frameworks such as PyTorch, TensorFlow, and Paddle.

Southbound APIs : Built on top of CUDA libraries to issue low‑level GPU commands.

Installation

NCCL is primarily distributed for Linux. It can be installed as a single package or via the NVIDIA HPC SDK. Example commands:

# Ubuntu example
sudo apt install libnccl2=2.25.1-1+cuda12.8 libnccl-dev=2.25.1-1+cuda12.8
# RHEL/CentOS example
sudo yum install libnccl-2.25.1-1+cuda12.8 libnccl-devel-2.25.1-1+cuda12.8 libnccl-static-2.25.1-1+cuda12.8

NCCL Core Concepts

process : Same meaning as MPI process.

rank : Unique identifier within a communicator; often maps 1:1 to a GPU but not required.

group : Used for grouping NCCL operations via ncclGroupStart() / ncclGroupEnd().

communicator ( ncclComm_t): Holds nranks, rank, and a unique ncclUniqueId.

node , nproc_per_node, nnodes, nranks, node_rank, local_rank: Standard distributed‑training terminology.

Typical NCCL Program Flow

Generate a unique NCCL ID on rank 0 with ncclGetUniqueId and broadcast it via MPI.

Each rank calls ncclCommInitRank to create its communicator.

Allocate GPU buffers and CUDA streams.

Optionally wrap multiple NCCL calls in ncclGroupStart/End to batch them.

Execute collective primitives such as ncclAllReduce, ncclBroadcast, etc.

Synchronize streams, free resources, and finalize MPI.

#include <stdio.h>
#include <cuda_runtime.h>
#include <nccl.h>
#include <mpi.h>

int main(int argc, char* argv[]) {
    const int size = 32*1024*1024; // 32 MiB per GPU
    int myRank, nRanks, localRank = 0;
    MPICHECK(MPI_Init(&argc, &argv));
    MPICHECK(MPI_Comm_rank(MPI_COMM_WORLD, &myRank));
    MPICHECK(MPI_Comm_size(MPI_COMM_WORLD, &nRanks));
    // 1) Determine localRank via hostname hash
    char hostname[1024]; getHostName(hostname, 1024);
    uint64_t hostHash = getHostHash(hostname);
    uint64_t allHashes[nRanks];
    MPICHECK(MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
                           allHashes, sizeof(uint64_t), MPI_BYTE,
                           MPI_COMM_WORLD));
    for (int p=0; p<myRank; ++p) if (allHashes[p]==hostHash) ++localRank;
    // 2) Allocate buffers
    float *sendbuf, *recvbuf; cudaStream_t s;
    CUDACHECK(cudaSetDevice(localRank));
    CUDACHECK(cudaMalloc(&sendbuf, size*sizeof(float)));
    CUDACHECK(cudaMalloc(&recvbuf, size*sizeof(float)));
    CUDACHECK(cudaMemset(sendbuf, 1, size*sizeof(float)));
    CUDACHECK(cudaMemset(recvbuf, 0, size*sizeof(float)));
    CUDACHECK(cudaStreamCreate(&s));
    // 3) NCCL init
    ncclUniqueId id; if (myRank==0) NCCLCHECK(ncclGetUniqueId(&id));
    MPICHECK(MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));
    ncclComm_t comm; NCCLCHECK(ncclCommInitRank(&comm, nRanks, id, myRank));
    // 4) All‑reduce
    NCCLCHECK(ncclAllReduce(sendbuf, recvbuf, size, ncclFloat, ncclSum, comm, s));
    CUDACHECK(cudaStreamSynchronize(s));
    printf("[MPI Rank %d] Success 
", myRank);
    // 5) Cleanup
    NCCLCHECK(ncclCommDestroy(comm));
    CUDACHECK(cudaFree(sendbuf)); CUDACHECK(cudaFree(recvbuf));
    MPICHECK(MPI_Finalize());
    return 0;
}

NCCL Core Advantages

Hardware‑aware topology detection (NVLink, PCIe, InfiniBand, RoCEv2) and automatic path selection.

Computation‑communication overlap via CUDA streams.

Dynamic algorithm selection (Ring, Tree, CollNet) based on message size and topology.

Communication Algorithms

Ring

Each GPU forwards data to its neighbor, forming a logical ring. Advantages: high bandwidth utilization on intra‑node NVLink; simple implementation. Drawbacks: latency grows linearly with the number of GPUs, making it less suitable for large multi‑node clusters.

Tree (Double‑Binary Tree)

Introduced in NCCL 2.4, the tree algorithm reduces latency to O(log N) by building a hierarchical communication pattern. It achieves better scaling across many nodes but may have slightly lower bandwidth than Ring on small clusters.

CollNet

Built on NVIDIA’s SHARP in‑network computing, CollNet provides ultra‑low latency and near‑NVLink bandwidth for very large clusters, but requires specialized hardware.

Channels and Path Selection

NCCL creates multiple logical Channels that map to independent CUDA blocks, allowing parallel data movement across several hardware links. The Path module selects the highest‑bandwidth, lowest‑latency physical route (NV# > PIX > PXB > PHB > SYS) before Channels are allocated.

Protocol Layers

Simple : Large‑message protocol that uses memory fences; maximizes throughput but incurs high latency for small messages.

LL (Low‑Latency) : Sends 8‑byte packets with a 4‑byte flag; avoids fences, ideal for tiny messages but limited bandwidth.

LL128 : 128‑byte packets with flag; combines low latency with higher bandwidth, requires hardware support for 128‑byte atomic writes.

Physical Transport – GPU Direct RDMA

GPU Direct RDMA (GDR) allows NICs to read/write GPU memory directly, bypassing host memory and PCIe, dramatically reducing latency and increasing bandwidth for inter‑node GPU communication.

Performance Testing with nccl‑tests

Example command to benchmark an 8‑GPU node using the Ring algorithm and the Simple protocol:

NCCL_DEBUG_FILE=./nccl.log NCCL_TOPO_DUMP_FILE=./topo.xml NCCL_PROTO=Simple NCCL_ALGO=Ring ./all_reduce_perf -b 1M -e 2048M -f 2 -g 8

The output shows algorithm bandwidth (algbw) and bus bandwidth (busbw). For a 2048 MiB payload the best bus bandwidth reached ~288 GB/s. Given a theoretical per‑GPU bandwidth of 900 GB/s, the observed utilization is about 36 % (288 GB/s ÷ 787.5 GB/s theoretical effective bandwidth).

Analysis

The gap between measured and theoretical bandwidth indicates room for optimization, such as improving topology‑aware channel allocation, using CollNet where available, or tuning message sizes to trigger the LL128 protocol for medium‑sized messages.

high performance computingParallel ComputingNCCLMPIcollective operationsGPU communicationAI distributed training
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.