AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance
AutoCCL analyzes NCCL’s six key performance parameters, uses coordinate‑descent and an online leader‑worker architecture to automatically adjust them during training, overcoming state‑space explosion and compute‑communication interference, and achieves 1.07‑1.32× faster iteration times on models such as Phi‑2, Llama‑3.1‑8B and VGG‑19.
Challenges of Tuning NCCL
NCCL contains many performance‑sensitive parameters; six of them are identified as critical: algorithm (A), protocol (P), transport (T), number of channels (NC), number of threads (NT), and chunk size (C). The combinatorial explosion of possible settings makes exhaustive search infeasible.
In real DNN training, communication and computation run concurrently and fiercely compete for GPU resources, causing compute‑communication interference. Parameters tuned offline in a pure‑communication environment often become sub‑optimal when the full training workload is executed. Moreover, the interference pattern is difficult to predict and model.
AutoCCL Automatic Tuning Approach
AutoCCL first performs an empirical analysis of NCCL’s key parameters and discovers that their impact on performance follows a unimodal (single‑peak) pattern. This observation enables the use of a coordinate‑descent method, which iteratively searches along each parameter dimension for a direction that improves performance, efficiently approaching the optimum.
To address compute‑communication interference, AutoCCL conducts online tuning during the early iterations of the DNN training task. It adopts a leader‑worker architecture: a designated GPU (Leader) runs an optimizer that executes the coordinate‑descent search, while a Coordinator atomically broadcasts any newly discovered configuration to all Workers. All nodes then switch to the new configuration for subsequent communication.
Optimization Results
Across several large language models (Phi‑2, Llama‑3.1‑8B, Yi‑1.5‑34B) and the VGG‑19 vision model, AutoCCL delivers iteration‑time improvements ranging from 1.07× to 1.32×.
Conclusion
Advantages: AutoCCL systematically analyzes NCCL’s underlying parameters and provides an efficient automatic tuning method that boosts communication performance without requiring changes to the upper‑level training framework.
Limitations: The current implementation does not adjust certain IB‑related parameters, which can cause training failures; and it does not re‑trigger tuning when network conditions change dramatically after the initial optimization, potentially leaving the system with a sub‑optimal configuration.
Paper Information
Title: AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training (NSDI 2025)
Authors’ affiliations: University of Science and Technology of China, Microsoft Research, Anhui Key Laboratory of Biomedical Imaging and Intelligent Processing, Hefei Comprehensive National Science Center AI Institute
Open‑source repository: https://github.com/gbxu/autoccl
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
