Artificial Intelligence 6 min read

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

A recently discovered bug in popular LLM libraries shows that gradient accumulation can introduce significant accuracy loss compared to true large‑batch training, especially when sequence lengths vary, and the issue can be fixed by correcting the loss denominator scaling.

Baobao Algorithm Notes

Oct 20, 2024

Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs

Background

When training large language models (LLMs) on GPUs with limited memory, gradient accumulation is used to simulate a larger batch size. Gradients from several mini‑batches are summed, averaged, and the model weights are updated only once.

Observed Issue

Many open‑source libraries (including HuggingFace’s) implement gradient accumulation by summing the per‑token cross‑entropy loss for each mini‑batch and then averaging the summed loss over the accumulation steps. This procedure does **not** scale the denominator (the count of non‑padding tokens) correctly, causing short output sequences to be overweighted and long sequences to be underweighted. The bug leads to a noticeable degradation in fine‑tuning accuracy.

Mathematical Explanation

For a full batch the cross‑entropy loss is - (1 / N) * Σ_{i=1}^{N} log p_i where N is the total number of valid (non‑ignored) tokens in the batch.

With gradient accumulation over K steps, the buggy implementation computes

L_acc = (1 / K) * Σ_{k=1}^{K} ( Σ_{i∈batch_k} loss_i / n_k )

Here n_k is the number of valid tokens in mini‑batch k. Because the outer division by K does not account for the varying n_k, the effective denominator becomes K * (average n_k) instead of the true total Σ n_k. When average sequence lengths differ (e.g., m₁ = 10 and m₂ = 1000), the loss contributed by long sequences is compressed while that of short sequences is inflated, breaking the equivalence with true large‑batch training.

Correct Fix

The denominator must be the total number of valid tokens across all accumulation steps:

L_correct = - (1 / Σ_{k=1}^{K} n_k) * Σ_{k=1}^{K} Σ_{i∈batch_k} log p_i

Implementation‑wise, each mini‑batch should return both the summed loss and its token count; the trainer aggregates the sums and divides **once** after the K accumulation steps.

Empirical Validation

Experiments show that after applying the corrected scaling, loss curves from gradient‑accumulated training match those from a true large‑batch run, confirming that the denominator bug was the root cause.

Practical Recommendations

Verify whether your training pipeline uses the buggy loss aggregation (see GitHub issue https://github.com/huggingface/trl/issues/2175).

Until the libraries are patched, either avoid gradient accumulation or upgrade to a version that implements the corrected loss computation.

If you write a custom trainer, ensure the loss denominator is accumulated across steps as shown above.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning gradient accumulation LLM training HuggingFace large batch loss scaling bug

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.