Tag

elastic-training

0 views collected around this technical thread.

DataFunSummit
DataFunSummit
Jul 1, 2023 · Artificial Intelligence

Alibaba Cloud Native Deep Learning Platform PAI‑DLC: Architecture, Features, and Future Outlook

This article introduces Alibaba Cloud's PAI‑DLC, a cloud‑native deep learning platform that integrates machine‑learning capabilities, containerized services, AI‑aware scheduling, GPU virtualization, elastic training with EasyScale, data access, and observability, and discusses its architecture, key features, and future directions.

AI PlatformDeep LearningGPU virtualization
0 likes · 16 min read
Alibaba Cloud Native Deep Learning Platform PAI‑DLC: Architecture, Features, and Future Outlook
DataFunSummit
DataFunSummit
Apr 26, 2022 · Artificial Intelligence

Elastic Distributed Training at Huya: Design, Implementation, and Results

This talk describes Huya’s elastic distributed training system, covering the motivation behind elasticity, its design using Kubernetes and ETCD for dynamic node registration and scaling, implementation details of the EFDL framework, performance evaluations on ResNet‑50, and the resulting benefits and future directions.

AI PlatformGPU schedulingHuya
0 likes · 11 min read
Elastic Distributed Training at Huya: Design, Implementation, and Results
DataFunTalk
DataFunTalk
Apr 23, 2022 · Artificial Intelligence

Elastic Distributed Training at Huya: Design, Implementation, and Results

This article describes Huya's elastic distributed training system, explaining why elasticity is needed, the architectural design using Kubernetes and ETCD, the dynamic scaling process, performance evaluations on ResNet‑50, and future improvements for more efficient and reliable AI model training.

AI PlatformGPU schedulingdistributed training
0 likes · 10 min read
Elastic Distributed Training at Huya: Design, Implementation, and Results
DataFunTalk
DataFunTalk
Feb 17, 2022 · Cloud Native

ByteDance's Cloud‑Native Transformation of Its Machine Learning Platform

This article explains how ByteDance redesigned its machine‑learning platform using cloud‑native principles, detailing motivations, the shift from Yarn to Kubernetes, the implementation of PS‑Worker and AllReduce frameworks, unified operators, heterogeneous resource scheduling, elastic training, and future directions for large‑scale AI workloads.

Resource Schedulingcloud-nativeelastic-training
0 likes · 15 min read
ByteDance's Cloud‑Native Transformation of Its Machine Learning Platform