Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

A recently discovered bug in popular LLM libraries shows that gradient accumulation can introduce significant accuracy loss compared to true large‑batch training, especially when sequence lengths vary, and the issue can be fixed by correcting the loss denominator scaling.

Deep LearningLLM traininggradient accumulation

0 likes · 6 min read

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

Alimama Tech

Sep 12, 2023 · Artificial Intelligence

Megatron-LLaMA: High-Performance Large Language Model Training Framework

Megatron-LLaMA is an open‑source high‑performance training framework for LLaMA models, offering tensor, pipeline, and sequence parallelism, an overlapped optimizer, and near‑linear scalability, achieving up to 176% speedup on 32 GPUs and robust performance even with limited network bandwidth.

DeepSpeedDistributed TrainingGPU Optimization

0 likes · 10 min read

Megatron-LLaMA: High-Performance Large Language Model Training Framework