Tagged articles

NCCL

10 articles · Page 1 of 1

May 30, 2026 · Artificial Intelligence

vLLM Introduces Native RL API for Seamless Weight Synchronization

vLLM’s new native RL API introduces a four‑stage weight‑transfer protocol, pluggable backends, and a keep‑mode pause/resume mechanism that eliminates deadlocks in DPEP deployments, with large‑scale validations on SkyRL and Prime‑RL demonstrating reliability and performance gains.

CUDA IPCDistributed InferenceNCCL

0 likes · 14 min read

vLLM Introduces Native RL API for Seamless Weight Synchronization

AntTech

Nov 27, 2025 · Artificial Intelligence

How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models

The article explains the design, implementation, and performance of the AMem NCCL‑Plugin, a lightweight extension to NVIDIA's NCCL that enables transparent offloading and rapid recovery of GPU memory during reinforcement‑learning training of trillion‑parameter models, detailing its architecture, APIs, benchmarks, installation steps, and integration guidelines.

ASystemGPUNCCL

0 likes · 18 min read

How AMem NCCL‑Plugin Cuts GPU Memory Overhead for Trillion‑Parameter RL Models

AI Cyberspace

Nov 19, 2025 · Artificial Intelligence

Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

This article explains how AI model training has evolved from single‑GPU workloads to massive distributed training using MPI for CPU‑centric communication and NCCL for GPU‑centric communication, covering their histories, core concepts, programming interfaces, topology discovery, protocol choices, and performance testing on multi‑GPU clusters.

AI distributed trainingGPU communicationHigh-performance computing

0 likes · 71 min read

Why MPI and NCCL Are Critical for Scaling AI Models Across Thousands of GPUs

Network Intelligence Research Center (NIRC)

Nov 1, 2025 · Artificial Intelligence

AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance

AutoCCL analyzes NCCL’s six key performance parameters, uses coordinate‑descent and an online leader‑worker architecture to automatically adjust them during training, overcoming state‑space explosion and compute‑communication interference, and achieves 1.07‑1.32× faster iteration times on models such as Phi‑2, Llama‑3.1‑8B and VGG‑19.

AutoCCLCoordinate DescentDistributed Deep Learning

0 likes · 5 min read

AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance

AI Cyberspace

Mar 14, 2025 · Artificial Intelligence

How NCCL Accelerates Distributed AI Training on GPUs

This article explains the origins, core functions, installation steps, and programming examples of NVIDIA’s Collective Communication Library (NCCL), detailing its role in multi‑GPU and multi‑node AI distributed training, topology discovery, path selection, channel search, and various collective communication operations.

CUDAGPU communicationMPI

0 likes · 33 min read

How NCCL Accelerates Distributed AI Training on GPUs

Bilibili Tech

May 24, 2024 · Cloud Computing

Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training

The article explains how NCCL’s collective communication libraries enable efficient large‑scale model training by parsing GPU‑to‑NIC topology, forming flat‑ring and tree rings, improving logging and bandwidth metrics, detailing Ring AllReduce primitives, and proposing solutions to missing topology, metric, and mapping information for future optimization.

GPUNCCLPerformance Optimization

0 likes · 23 min read

Understanding and Optimizing NCCL Collective Communication Libraries for Large‑Scale Model Training

Architects' Tech Alliance

May 5, 2024 · Artificial Intelligence

Why InfiniBand Is the Secret Weapon for AIGC Training Performance

The article examines how InfiniBand’s specialized features—collective communication, in‑network computing, adaptive routing, congestion control, cut‑through forwarding, shallow buffering, and self‑healing—are optimized for large‑scale AI‑generated content (AIGC) training, delivering higher bandwidth, lower latency, and greater fault tolerance than Ethernet alternatives.

AI trainingAIGCAdaptive routing

0 likes · 10 min read

Why InfiniBand Is the Secret Weapon for AIGC Training Performance

DataFunSummit

Apr 7, 2022 · Artificial Intelligence

Optimizing Distributed Machine Learning Training on Google Cloud Vertex AI: Fast Socket and Reduction Server

This article explains how Google Cloud Vertex AI improves large‑scale distributed machine learning training performance by addressing the memory‑wall challenge with Fast Socket network stack enhancements for NCCL and a Reduction Server that accelerates gradient aggregation, delivering higher throughput and lower TCO for AI workloads.

Fast SocketGPUNCCL

0 likes · 19 min read

Optimizing Distributed Machine Learning Training on Google Cloud Vertex AI: Fast Socket and Reduction Server

DataFunTalk

Mar 17, 2022 · Artificial Intelligence

Optimizing Distributed Machine Learning Training on Google Vertex AI: Fast Socket and Reduction Server

This article explains how Google Vertex AI tackles the memory‑wall challenge of large‑scale distributed training by introducing Fast Socket, a high‑performance NCCL network stack, and a Reduction Server that halves gradient‑aggregation traffic, delivering significant speed‑up and cost‑reduction for AI workloads.

AI performanceFast SocketNCCL

0 likes · 19 min read

Optimizing Distributed Machine Learning Training on Google Vertex AI: Fast Socket and Reduction Server

Tencent Cloud Developer

May 22, 2020 · Artificial Intelligence

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL

WeChat’s Scan‑to‑Identify system now trains its CNN models across multiple GPUs using Horovod’s data‑parallel, synchronous Ring All‑Reduce architecture built on MPI and NCCL, cutting training time from several days to under one day while maintaining accuracy, and future work will target I/O and further scaling.

AIHorovodMPI

0 likes · 12 min read

Distributed Training for WeChat Scan-to-Identify Using Horovod, MPI, and NCCL