Tagged articles
4 articles
Page 1 of 1
NewBeeNLP
NewBeeNLP
Mar 21, 2024 · Artificial Intelligence

Mastering Large Language Model Training: Key Challenges and Optimization Strategies

This article examines the resource and efficiency challenges of scaling large language model training, explains data, model, pipeline, and tensor parallelism, and provides practical I/O, communication, and stability optimization techniques—including high‑availability storage, RDMA networking, NCCL tuning, and fault‑tolerant recovery—to improve throughput and reliability.

AI EngineeringDistributed TrainingI/O optimization
0 likes · 15 min read
Mastering Large Language Model Training: Key Challenges and Optimization Strategies
DataFunTalk
DataFunTalk
Mar 20, 2024 · Artificial Intelligence

Challenges and Optimization Techniques for Large Language Model Training

The article outlines the resource and efficiency challenges of scaling large language models, explains data and model parallelism strategies, and details practical I/O, communication, and stability optimizations—including high‑availability storage, RDMA networking, and fault‑tolerance measures—to improve training throughput and reliability.

AI EngineeringI/O optimizationcommunication optimization
0 likes · 13 min read
Challenges and Optimization Techniques for Large Language Model Training
Tencent Cloud Developer
Tencent Cloud Developer
Sep 1, 2021 · Artificial Intelligence

Why Distributed Machine Learning Accelerates AI Training at Scale

This article reviews how distributed machine learning tackles massive data and compute challenges by partitioning models and data across workers, optimizing communication with primitives, parameter servers, and Ring AllReduce, reducing IO overhead, and applying advanced optimizers such as LARS and LAMB to achieve faster, scalable training.

LAMB optimizerLARS optimizerParameter Server
0 likes · 31 min read
Why Distributed Machine Learning Accelerates AI Training at Scale
Kuaishou Tech
Kuaishou Tech
Jul 16, 2021 · Artificial Intelligence

Bagua: An Open‑Source Distributed Training Framework for Deep Learning

Bagua is a distributed training framework co‑developed by Kuaishou and ETH Zürich that combines algorithmic and system‑level optimizations—such as decentralized, asynchronous, and compressed communication—to achieve up to 60% higher performance than existing frameworks like PyTorch‑DDP, Horovod, and BytePS across various AI workloads.

BaguaDeep LearningDistributed Training
0 likes · 15 min read
Bagua: An Open‑Source Distributed Training Framework for Deep Learning