How NCCL Accelerates Distributed AI Training on GPUs

This article explains the origins, core functions, installation steps, and programming examples of NVIDIA’s Collective Communication Library (NCCL), detailing its role in multi‑GPU and multi‑node AI distributed training, topology discovery, path selection, channel search, and various collective communication operations.

AI Cyberspace
AI Cyberspace
AI Cyberspace
How NCCL Accelerates Distributed AI Training on GPUs

Previous Articles

From re:Invent 2024 see AWS cutting‑edge AI infrastructure

GPU basic operating principles

GPU chip architecture history

GPU virtualization technology principles

8‑GPU server and NVLink/NVSwitch interconnect technology

NVIDIA InfiniBand AI high‑performance network

RDMA high‑performance communication technology principles

RoCEv2 high‑performance transport protocol and lossless network

Large‑scale RDMA AI networking innovation: algorithm and programmable hardware deep integration

GPU/CUDA development chronology: from 3D rendering to AI large‑model era (Part 1)

AI Distributed Training

In early AI model training, a single GPU could hold the model, parameters, and data, so one GPU was enough. Modern large‑scale AI models require splitting the model across thousands of GPUs, making multi‑GPU and multi‑node distributed training inevitable.

The Birth of NCCL

In the 1990s, the explosion of supercomputing and distributed computing needs made traditional single‑node computing insufficient for large scientific simulations and data processing. Developers needed a standardized communication interface for cross‑node process communication and coordination.

Initially, supercomputer vendors (e.g., IBM, Cray) offered proprietary communication libraries, which suffered from poor code compatibility and high porting costs. In 1994, the Message Passing Interface (MPI) was introduced to unify message‑passing standards and address fragmentation. MPI was designed for CPU‑centric architectures.

OpenMPI, an open‑source implementation of MPI, emerged during this period, supporting point‑to‑point and collective communication.

Point‑to‑point communication: supports blocking and non‑blocking modes.

Collective communication: supports all‑reduce, all‑to‑all, all‑gather, etc.

By 2009, OpenMPI was mature, but MPI was not designed for GPU architectures and lacked latency and bandwidth considerations for GPU parallel computing.

Therefore, in 2015 NVIDIA built NCCL (NVIDIA Collective Communication Library) on top of MPI, specifically optimized for NVIDIA GPUs to enable efficient multi‑GPU and multi‑node communication in distributed AI training.

Note that MPI and NCCL are not alternatives but complementary: CPU communication is handled by MPI, while GPU communication is handled by NCCL.

The NVIDIA Collective Communication Library (NCCL) implements multi‑GPU and multi‑node communication primitives optimized for NVIDIA GPUs and Networking.

Core Functions of NCCL

The following diagram shows the deep‑learning software stack. NCCL’s northbound side interfaces with AI frameworks such as PyTorch, Paddle, TensorFlow, which call NCCL libraries to control GPU‑to‑GPU data transfer. The southbound side interfaces with the CUDA library, which ultimately controls GPU device communication.

Collective Communication Verbs API : NCCL provides various collective communication operations for upper‑level AI frameworks.

Point‑to‑point Communication Verbs API : NCCL also supports send/receive point‑to‑point communication.

GPU/CUDA Communication Control : NCCL calls CUDA APIs to control GPU communication; CUDA programs can seamlessly use NCCL for high‑efficiency data transfer and synchronization.

RNIC/RDMA Data Transfer : NCCL invokes the RDMA API to perform cross‑node RNIC data transfer (GPU Direct RDMA).

GPU Communication Adaptive Topology : NCCL detects and adapts to hardware topologies, including intra‑node GPU, NVLink, NVSwitch, PCIe Switch, CPU, as well as inter‑node RNIC topologies, automatically selecting the best path.

GPU Communication Path Auto‑Optimization : Based on topology, NCCL automatically chooses the highest‑performance communication path, preferring NVLink within a node over slower RNIC links.

Installation and Deployment

NCCL is primarily developed and optimized for Linux and is integrated with AI frameworks on Linux; it does not support Windows. NCCL can be installed via a single‑package or through the NVIDIA HPC SDK bundle. Download address: https://developer.nvidia.com/nccl

Note: In some environments NCCL may already be installed together with CUDA or PyTorch, so verify its presence before manual installation.

The NVIDIA HPC SDK bundle includes many development and analysis tools.

The single‑package installation installs NCCL manually; this article focuses on that method.

Manual installation steps:

Download NCCL matching your CUDA version, CPU architecture, and Linux distribution; both offline and online packages are available.

$ uname -a
$ cat /etc/redhat-release
$ lspci -nn | grep -i nvidia

Install the offline package using rpm or dpkg.

If using an online repository, install via yum/apt.

# For Ubuntu:
sudo apt install libnccl2=2.25.1-1+cuda12.8 libnccl-dev=2.25.1-1+cuda12.8

# For RHEL/CentOS:
sudo yum install libnccl-2.25.1-1+cuda12.8 libnccl-devel-2.25.1-1+cuda12.8 libnccl-static-2.25.1-1+cuda12.8

libnccl : NCCL shared library.

libnccl-static : NCCL static library.

libnccl-devel : NCCL development package (headers and tools).

Install MPI, which NCCL depends on for CPU‑side communication.

sudo apt-get update
sudo apt-get install openmpi-bin openmpi-doc libopenmpi-dev
mpirun --version

Verify installation: if the following Python command prints a version number, NCCL is successfully installed and integrated with PyTorch.

python -c "import torch; print(torch.cuda.nccl.version())"

Basic Concepts of NCCL

rank : Globally unique identifier for each process in a multi‑node training job; usually one rank per GPU.

node : A physical server, typically containing 8 GPUs.

nnodes : Number of nodes in the job.

node_rank : Unique identifier of a node within the job.

local_rank : Identifier of a process within a node (e.g., 0‑7 on an 8‑GPU server).

nproc_per_node : Number of rank processes per node, usually 8.

nranks / WORLD_SIZE : Total number of ranks across all nodes.

group : Process group, used when multiple groups are needed.

communicator : A set of ranks that can communicate with each other; a rank may belong to multiple communicators.

NCCL Programming Example

An official NCCL demo program includes ten key steps; the main function is heavily commented.

Source: https://github.com/NVIDIA/nccl

#include <stdio.h>
#include "cuda_runtime.h"
#include "nccl.h"
#include "mpi.h"
#include <unistd.h>
#include <stdint.h>

#define MPICHECK(cmd) do { \
  int e = cmd; \
  if (e != MPI_SUCCESS) { \
    printf("Failed: MPI error %s:%d '%d'
", __FILE__, __LINE__, e); \
    exit(EXIT_FAILURE); \
  } \
} while(0)

#define CUDACHECK(cmd) do { \
  cudaError_t e = cmd; \
  if (e != cudaSuccess) { \
    printf("Failed: Cuda error %s:%d '%s'
", __FILE__, __LINE__, cudaGetErrorString(e)); \
    exit(EXIT_FAILURE); \
  } \
} while(0)

#define NCCLCHECK(cmd) do { \
  ncclResult_t r = cmd; \
  if (r != ncclSuccess) { \
    printf("Failed, NCCL error %s:%d '%s'
", __FILE__, __LINE__, ncclGetErrorString(r)); \
    exit(EXIT_FAILURE); \
  } \
} while(0)

static uint64_t getHostHash(const char* string) {
  uint64_t result = 5381;
  for (int c = 0; string[c] != '\0'; c++) {
    result = ((result << 5) + result) ^ string[c];
  }
  return result;
}

static void getHostName(char* hostname, int maxlen) {
  gethostname(hostname, maxlen);
  for (int i = 0; i < maxlen; i++) {
    if (hostname[i] == '.') { hostname[i] = '\0'; return; }
  }
}

int main(int argc, char* argv[]) {
  int size = 32*1024*1024;
  int myRank, nRanks, localRank = 0;
  // 1. Initialize MPI on each node.
  MPICHECK(MPI_Init(&argc, &argv));
  MPICHECK(MPI_Comm_rank(MPI_COMM_WORLD, &myRank));
  MPICHECK(MPI_Comm_size(MPI_COMM_WORLD, &nRanks));
  // 2. Compute hostname hash.
  uint64_t hostHashs[nRanks];
  char hostname[1024];
  getHostName(hostname, 1024);
  hostHashs[myRank] = getHostHash(hostname);
  // 3. Gather all hostname hashes.
  MPICHECK(MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, hostHashs, sizeof(uint64_t), MPI_BYTE, MPI_COMM_WORLD));
  // Determine localRank based on identical hashes.
  for (int p = 0; p < nRanks; p++) {
    if (p == myRank) break;
    if (hostHashs[p] == hostHashs[myRank]) localRank++;
  }
  int nDev = 1;
  // 4. Allocate buffers and create CUDA streams.
  float** sendbuff = (float**)malloc(nDev * sizeof(float*));
  float** recvbuff = (float**)malloc(nDev * sizeof(float*));
  cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev);
  for (int i = 0; i < nDev; ++i) {
    CUDACHECK(cudaSetDevice(localRank*nDev + i));
    CUDACHECK(cudaMalloc(sendbuff + i, size * sizeof(float)));
    CUDACHECK(cudaMalloc(recvbuff + i, size * sizeof(float)));
    CUDACHECK(cudaMemset(sendbuff[i], 1, size * sizeof(float)));
    CUDACHECK(cudaMemset(recvbuff[i], 0, size * sizeof(float)));
    CUDACHECK(cudaStreamCreate(s+i));
  }
  ncclUniqueId id;
  ncclComm_t comms[nDev];
  // 6. Rank0 generates NCCL unique ID and broadcasts it.
  if (myRank == 0) ncclGetUniqueId(&id);
  MPICHECK(MPI_Bcast((void*)&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));
  // 7. Initialize NCCL communicators.
  NCCLCHECK(ncclGroupStart());
  for (int i = 0; i < nDev; i++) {
    NCCLCHECK(ncclCommInitRank(comms+i, nRanks*nDev, id, myRank*nDev + i));
  }
  NCCLCHECK(ncclGroupEnd());
  // 8. Perform NCCL all‑reduce.
  NCCLCHECK(ncclGroupStart());
  for (int i = 0; i < nDev; i++)
    NCCLCHECK(ncclAllReduce((const void*)sendbuff[i], (void*)recvbuff[i], size, ncclFloat, ncclSum, comms[i], s[i]));
  NCCLCHECK(ncclGroupEnd());
  // 9. Synchronize streams.
  for (int i = 0; i < nDev; i++) CUDACHECK(cudaStreamSynchronize(s[i]));
  // 10. Cleanup.
  for (int i = 0; i < nDev; i++) {
    CUDACHECK(cudaFree(sendbuff[i]));
    CUDACHECK(cudaFree(recvbuff[i]));
    ncclCommDestroy(comms[i]);
  }
  MPICHECK(MPI_Finalize());
  printf("[MPI Rank %d] Success 
", myRank);
  return 0;
}

Basic Workflow of NCCL

Topology Discovery : Build a topology consisting of GPUs, NICs, CPUs, PCIe switches, NVLink, NVSwitch, etc.

Path Selection : Search for optimal rings or trees.

GPU Connection : Connect GPUs via intra‑node PCIe/NVLink and inter‑node GPU Direct RDMA.

Communication Operations : Execute all‑reduce and other collective operations.

NCCL Network Initialization

The AI program loads NCCL and first performs environment initialization, which includes:

Initialize NCCL unique ID (rank0 calls ncclGetUniqueId()).

Broadcast the unique ID to all ranks via MPI.

Each rank calls ncclCommInitRank() with the same ID.

Initialize device resources by calling ncclCudaLibraryInit() to load the CUDA driver and obtain GPU device IDs.

Bootstrap Network Initialization

NCCL uses two networks:

Bootstrap network : Transfers control‑plane information between ranks during initialization, using MPI because the data volume is tiny.

Training data network : Transfers training data; prefers NVLink within a node and GPU Direct RDMA across nodes.

Bootstrap initialization steps:

Rank0 generates a unique ID, starts a bootstrapRoot thread, and selects a network interface (or uses NCCL_SOCKET_IFNAME).

Rank0 broadcasts the ID via MPI_Bcast; other ranks receive it and call ncclCommInitRank.

During ncclCommInitRank, each rank collects its own socket address and exchanges it with rank0, forming a ring of socket addresses.

ncclResult_t ncclGetUniqueId(ncclUniqueId* out) {
  NCCLCHECK(ncclInit());
  NCCLCHECK(PtrCheck(out, "GetUniqueId", "out"));
  return bootstrapGetUniqueId(out);
}

Data Network Initialization

NCCL supports IB and RoCEv2. Initialization loads libibverbs.so to determine RDMA capability and enumerates RNIC devices, recording device‑to‑port mappings, IB/ROCE enable flags, PCIe paths, etc. The NCCL_IB_HCA environment variable can force a specific RNIC.

Topology Discovery and Graph Construction

NCCL discovers devices (GPUs, NICs, CPUs, PCIe switches, NVLink, NVSwitch) and records them in an XML description. The XML module creates ncclXmlNode structures with parent/child relationships and attributes such as bus ID and sysfs path. The Topo module converts the XML into a topology graph ( ncclTopoSystem).

Call ncclTopoAddCpu() to recursively discover GPUs and NICs for each NUMA node.

Call ncclTopoAddNvLinks(), ncclTopoAddC2c(), and ncclTopoConnectCpus() to add link information.

After bootstrap network setup, bootstrapAllGather synchronizes topology information across ranks.

$ nvidia-smi topo -m

Path Selection and Channel Search

NCCL supports Ring, Tree, and CollNet algorithms. Path selection chooses the physical route with highest bandwidth and lowest latency (NVLink > PCIe > Host Bridge > System). Channel search builds multiple virtual channels on top of the selected path to increase parallelism.

Communication Algorithms

Ring : GPUs form a closed loop; high bandwidth utilization but latency grows linearly with GPU count.

Tree (Double binary tree) : Introduced in NCCL 2.4; logarithmic latency, suitable for large multi‑node clusters, but bandwidth utilization is slightly lower than Ring.

CollNet : Built on SHARP in‑network computing; provides near‑NVLink bandwidth across nodes but requires NVIDIA‑specific hardware.

Path Types

NV# : Direct NVLink connection.

PIX : Single PCIe bridge.

PXB : Multiple PCIe bridges without crossing the host bridge.

PHB : Traverses a PCIe host bridge.

NOTE : Within a NUMA node but across CPUs.

SYS : Traverses inter‑NUMA links (QPI/UPI).

Channel Search

Channels are virtual communication lanes. NCCL searches for the best set of channels based on the chosen path, creating multiple Ring channels that may share the same optimal NVLink routes.

Ring Channel 1: G0 → G4 → G7 → G6 → G5 → G1 → G2 → G3 → G0
Ring Channel 2: G0 → G4 → G7 → G6 → G5 → G1 → G2 → G3 → G0
Ring Channel 3: G0 → G4 → G7 → G6 → G5 → G1 → G2 → G3 → G0

GPU Connection in NCCL

Allocate send and receive buffers for each channel.

Establish bidirectional links forming a ring.

Use CUDA streams and events for asynchronous communication, overlapping computation and data transfer.

Collective Communication Operations

Broadcast – distribute data from one GPU to all others.

Scatter – split data from one GPU to many.

Gather – collect data from many GPUs to one.

All‑gather – each GPU receives data from all others.

Reduce – aggregate data onto a single GPU.

All‑reduce – aggregate and broadcast the result to all GPUs (most common for gradient synchronization).

Reduce‑scatter – reduce then distribute partial results.

Send/Receive – point‑to‑point communication.

Topology diagram
Topology diagram
CUDANCCLMPIGPU communication
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.