Tag

communication optimization

0 views collected around this technical thread.

DataFunTalk
DataFunTalk
Mar 20, 2024 · Artificial Intelligence

Challenges and Optimization Techniques for Large Language Model Training

The article outlines the resource and efficiency challenges of scaling large language models, explains data and model parallelism strategies, and details practical I/O, communication, and stability optimizations—including high‑availability storage, RDMA networking, and fault‑tolerance measures—to improve training throughput and reliability.

AI EngineeringI/O optimizationStability
0 likes · 13 min read
Challenges and Optimization Techniques for Large Language Model Training
Kuaishou Tech
Kuaishou Tech
Jul 16, 2021 · Artificial Intelligence

Bagua: An Open‑Source Distributed Training Framework for Deep Learning

Bagua is a distributed training framework co‑developed by Kuaishou and ETH Zürich that combines algorithmic and system‑level optimizations—such as decentralized, asynchronous, and compressed communication—to achieve up to 60% higher performance than existing frameworks like PyTorch‑DDP, Horovod, and BytePS across various AI workloads.

BaguaGPU scalingPyTorch
0 likes · 15 min read
Bagua: An Open‑Source Distributed Training Framework for Deep Learning