Why Gradient Accumulation Isn’t Always Equivalent to Large‑Batch Training for LLMs
A recently discovered bug in popular LLM libraries shows that gradient accumulation can introduce significant accuracy loss compared to true large‑batch training, especially when sequence lengths vary, and the issue can be fixed by correcting the loss denominator scaling.
Background
When training large language models (LLMs) on GPUs with limited memory, gradient accumulation is used to simulate a larger batch size. Gradients from several mini‑batches are summed, averaged, and the model weights are updated only once.
Observed Issue
Many open‑source libraries (including HuggingFace’s) implement gradient accumulation by summing the per‑token cross‑entropy loss for each mini‑batch and then averaging the summed loss over the accumulation steps. This procedure does **not** scale the denominator (the count of non‑padding tokens) correctly, causing short output sequences to be overweighted and long sequences to be underweighted. The bug leads to a noticeable degradation in fine‑tuning accuracy.
Mathematical Explanation
For a full batch the cross‑entropy loss is - (1 / N) * Σ_{i=1}^{N} log p_i where N is the total number of valid (non‑ignored) tokens in the batch.
With gradient accumulation over K steps, the buggy implementation computes
L_acc = (1 / K) * Σ_{k=1}^{K} ( Σ_{i∈batch_k} loss_i / n_k )Here n_k is the number of valid tokens in mini‑batch k. Because the outer division by K does not account for the varying n_k, the effective denominator becomes K * (average n_k) instead of the true total Σ n_k. When average sequence lengths differ (e.g., m₁ = 10 and m₂ = 1000), the loss contributed by long sequences is compressed while that of short sequences is inflated, breaking the equivalence with true large‑batch training.
Correct Fix
The denominator must be the total number of valid tokens across all accumulation steps:
L_correct = - (1 / Σ_{k=1}^{K} n_k) * Σ_{k=1}^{K} Σ_{i∈batch_k} log p_iImplementation‑wise, each mini‑batch should return both the summed loss and its token count; the trainer aggregates the sums and divides **once** after the K accumulation steps.
Empirical Validation
Experiments show that after applying the corrected scaling, loss curves from gradient‑accumulated training match those from a true large‑batch run, confirming that the denominator bug was the root cause.
Practical Recommendations
Verify whether your training pipeline uses the buggy loss aggregation (see GitHub issue https://github.com/huggingface/trl/issues/2175).
Until the libraries are patched, either avoid gradient accumulation or upgrade to a version that implements the corrected loss computation.
If you write a custom trainer, ensure the loss denominator is accumulated across steps as shown above.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
