Tagged articles
4 articles
Page 1 of 1
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
May 24, 2024 · Artificial Intelligence

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

DeepRec Extension enhances large‑scale sparse model training by adding automatic elastic training, resource‑aware scheduling, real‑time monitoring, and efficient fault‑tolerance mechanisms, enabling lower cost, higher throughput, and more reliable distributed training for AI workloads.

AI InfrastructureDeepRecSparse Models
0 likes · 13 min read
How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance
DataFunSummit
DataFunSummit
Jul 1, 2023 · Artificial Intelligence

Alibaba Cloud Native Deep Learning Platform PAI‑DLC: Architecture, Features, and Future Outlook

This article introduces Alibaba Cloud's PAI‑DLC, a cloud‑native deep learning platform that integrates machine‑learning capabilities, containerized services, AI‑aware scheduling, GPU virtualization, elastic training with EasyScale, data access, and observability, and discusses its architecture, key features, and future directions.

AI PlatformCloud NativeDeep Learning
0 likes · 16 min read
Alibaba Cloud Native Deep Learning Platform PAI‑DLC: Architecture, Features, and Future Outlook
DataFunSummit
DataFunSummit
Apr 26, 2022 · Artificial Intelligence

Elastic Distributed Training at Huya: Design, Implementation, and Results

This talk describes Huya’s elastic distributed training system, covering the motivation behind elasticity, its design using Kubernetes and ETCD for dynamic node registration and scaling, implementation details of the EFDL framework, performance evaluations on ResNet‑50, and the resulting benefits and future directions.

AI PlatformDistributed TrainingGPU scheduling
0 likes · 11 min read
Elastic Distributed Training at Huya: Design, Implementation, and Results
DataFunTalk
DataFunTalk
Apr 23, 2022 · Artificial Intelligence

Elastic Distributed Training at Huya: Design, Implementation, and Results

This article describes Huya's elastic distributed training system, explaining why elasticity is needed, the architectural design using Kubernetes and ETCD, the dynamic scaling process, performance evaluations on ResNet‑50, and future improvements for more efficient and reliable AI model training.

AI PlatformGPU schedulingKubernetes
0 likes · 10 min read
Elastic Distributed Training at Huya: Design, Implementation, and Results