Tagged articles
5 articles
Page 1 of 1
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Nov 1, 2025 · Artificial Intelligence

AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance

AutoCCL analyzes NCCL’s six key performance parameters, uses coordinate‑descent and an online leader‑worker architecture to automatically adjust them during training, overcoming state‑space explosion and compute‑communication interference, and achieves 1.07‑1.32× faster iteration times on models such as Phi‑2, Llama‑3.1‑8B and VGG‑19.

AutoCCLCoordinate DescentDistributed Deep Learning
0 likes · 5 min read
AutoCCL: Automatic NCCL Tuning to Boost Distributed Deep Learning Performance
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 24, 2023 · Artificial Intelligence

How Alibaba’s TePDist Automates Distributed Deep Learning for Large Models

Alibaba Cloud’s PAI platform unveils TePDist, an HLO‑based automatic distributed deep‑learning system that decouples strategy search from model code, offers client/server architecture, supports SPMD and pipeline parallelism, delivers high performance on GPT, MoE and other models, and is now open‑source.

AI InfrastructureDistributed Deep LearningHLO IR
0 likes · 4 min read
How Alibaba’s TePDist Automates Distributed Deep Learning for Large Models
AntTech
AntTech
Jul 13, 2020 · Artificial Intelligence

ElasticDL: An Open‑Source Distributed Deep Learning Framework with Elastic Scheduling

ElasticDL is an open‑source distributed deep learning framework built on TensorFlow 2.x and Kubernetes that simplifies programming by letting users define models with the Keras API, while providing elastic scheduling, fault tolerance, and significant performance gains demonstrated through extensive benchmarks.

Distributed Deep LearningElasticDLKubernetes
0 likes · 19 min read
ElasticDL: An Open‑Source Distributed Deep Learning Framework with Elastic Scheduling
Alibaba Cloud Native
Alibaba Cloud Native
May 12, 2020 · Artificial Intelligence

Boosting Cloud‑Native AI Training with Alluxio: Performance Tuning on Kubernetes

This article examines the challenges of large‑scale deep‑learning model training on Kubernetes, analyzes performance bottlenecks caused by Alluxio‑FUSE integration, and presents a series of configuration and system‑level optimizations that dramatically improve data‑access speed and overall training throughput.

AI trainingAlluxioCloud Native
0 likes · 22 min read
Boosting Cloud‑Native AI Training with Alluxio: Performance Tuning on Kubernetes
AntTech
AntTech
Sep 11, 2019 · Artificial Intelligence

ElasticDL: An Open‑Source Elastic Deep Learning System Built on TensorFlow 2.0 and Kubernetes

ElasticDL, the first industry‑level open‑source system for elastic deep learning on TensorFlow, leverages Kubernetes‑native scheduling, fault‑tolerance, and TensorFlow 2.0 Eager Execution to dramatically improve cluster utilization, simplify distributed training, and integrate seamlessly with tools like Kubeflow and SQLFlow.

Distributed Deep LearningElasticDLKubernetes
0 likes · 13 min read
ElasticDL: An Open‑Source Elastic Deep Learning System Built on TensorFlow 2.0 and Kubernetes