Artificial Intelligence 8 min read

Why DeepSeek V4 Insists on Batch Invariance—and What It Costs

DeepSeek V4 achieves ultra‑long context, complex training pipelines, and custom high‑performance kernels by enforcing batch invariance, a design that guarantees bit‑wise identical outputs across varying batch shapes but incurs lower GPU utilization, reduced small‑batch speed, and added engineering complexity.

Machine Learning Algorithms & Natural Language Processing

Apr 28, 2026

Why DeepSeek V4 Insists on Batch Invariance—and What It Costs

Definition of batch invariance

Batch invariance means that for a given token the output is bit‑wise identical regardless of its position in the batch, the batch size, or the other tokens processed together. The technical report states that the core purpose is to guarantee reproducibility across pre‑training, fine‑tuning, reinforcement‑learning (RL) and inference, keeping all stages aligned.

Why batch invariance matters

Online services use dynamic batching; the same user request may be combined with different other requests at different times. Without batch invariance, the same prompt could produce different answers because the reduction order in kernels changes. Batch invariance therefore ensures:

Stable inference results despite dynamic batching.

Alignment between pre‑training, SFT, RL, on‑policy distillation and inference, making it easier to attribute behavior changes to data, RL, distillation, quantization, or batch organization.

Improved reproducibility and debuggability, because numerical discrepancies can be traced to batch arrangement.

A reliable foundation for complex long‑context systems.

Engineering sacrifices

Maintaining batch invariance imposes noticeable costs:

Reduced GPU utilization (wave‑front quantization issues).

Lower throughput for small batches or short sequences.

Reduced compatibility with native operators.

Limited freedom to apply certain sparse‑acceleration tricks.

Optimizations that conflict with batch invariance

DeepSeek V4 deliberately avoids two common performance tricks:

split‑KV : distributes attention computation across multiple SMs, changing the parallel reduction path and breaking bit‑wise consistency.

split‑K : partitions the reduction dimension of GEMM, altering the order of floating‑point additions and also violating batch invariance.

Engineering work‑arounds

To keep batch invariance while still handling different GPU load conditions, DeepSeek introduces a dual‑kernel strategy on the attention side: two separate programs handle the cases where the GPU is fully utilized and where it is not, guaranteeing identical results.

For matrix multiplication, the generic cuBLAS library is replaced by a custom batch‑invariant kernel called DeepGEMM . These choices increase engineering complexity because many tasks that would normally rely on standard libraries now require bespoke kernels and stricter computation paths.

Trade‑off summary

By sacrificing split‑KV, split‑K, native operator compatibility and some sparse‑acceleration freedom, DeepSeek V4 gains:

Bit‑wise reproducibility across training, inference and RL stages.

Stable long‑context, agent and RL training.

Exact alignment of results across multi‑node, multi‑GPU runs.

References

https://x.com/teortaxesTex/status/2048707398886404524?s=20

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

GPU utilization reproducibility LLM engineering DeepSeek V4 batch invariance custom kernels

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.