AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance

AutoCCL analyzes NCCL’s six key performance parameters, uses coordinate‑descent and an online leader‑worker architecture to automatically adjust them during training, overcoming state‑space explosion and compute‑communication interference, and achieves 1.07‑1.32× faster iteration times on models such as Phi‑2, Llama‑3.1‑8B and VGG‑19.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance

Challenges of Tuning NCCL

NCCL contains many performance‑sensitive parameters; six of them are identified as critical: algorithm (A), protocol (P), transport (T), number of channels (NC), number of threads (NT), and chunk size (C). The combinatorial explosion of possible settings makes exhaustive search infeasible.

In real DNN training, communication and computation run concurrently and fiercely compete for GPU resources, causing compute‑communication interference. Parameters tuned offline in a pure‑communication environment often become sub‑optimal when the full training workload is executed. Moreover, the interference pattern is difficult to predict and model.

AutoCCL Automatic Tuning Approach

AutoCCL first performs an empirical analysis of NCCL’s key parameters and discovers that their impact on performance follows a unimodal (single‑peak) pattern. This observation enables the use of a coordinate‑descent method, which iteratively searches along each parameter dimension for a direction that improves performance, efficiently approaching the optimum.

To address compute‑communication interference, AutoCCL conducts online tuning during the early iterations of the DNN training task. It adopts a leader‑worker architecture: a designated GPU (Leader) runs an optimizer that executes the coordinate‑descent search, while a Coordinator atomically broadcasts any newly discovered configuration to all Workers. All nodes then switch to the new configuration for subsequent communication.

Optimization Results

Across several large language models (Phi‑2, Llama‑3.1‑8B, Yi‑1.5‑34B) and the VGG‑19 vision model, AutoCCL delivers iteration‑time improvements ranging from 1.07× to 1.32×.

Conclusion

Advantages: AutoCCL systematically analyzes NCCL’s underlying parameters and provides an efficient automatic tuning method that boosts communication performance without requiring changes to the upper‑level training framework.

Limitations: The current implementation does not adjust certain IB‑related parameters, which can cause training failures; and it does not re‑trigger tuning when network conditions change dramatically after the initial optimization, potentially leaving the system with a sub‑optimal configuration.

Paper Information

Title: AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training (NSDI 2025)

Authors’ affiliations: University of Science and Technology of China, Microsoft Research, Anhui Key Laboratory of Biomedical Imaging and Intelligent Processing, Hefei Comprehensive National Science Center AI Institute

Open‑source repository: https://github.com/gbxu/autoccl

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Parameter TuningNCCLDistributed Deep LearningGPU communicationAutoCCLCoordinate Descent
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.