Tagged articles

loss scaling bug

1 articles · Page 1 of 1

Oct 20, 2024 · Artificial Intelligence

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

A recently discovered bug in popular LLM libraries shows that gradient accumulation can introduce significant accuracy loss compared to true large‑batch training, especially when sequence lengths vary, and the issue can be fixed by correcting the loss denominator scaling.

HuggingFaceLLM trainingdeep learning

0 likes · 6 min read

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs