Why Do GPUs and NPUs Produce Different FP16 Results? Uncovering AI Chip Precision Secrets

Engineers training large AI models often see noticeable FP16/BF16 result differences between GPUs and NPUs, and even between generations of the same chip, due to floating‑point representation limits, hardware design choices, software library implementations, compiler optimizations, and parallel execution nondeterminism.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Why Do GPUs and NPUs Produce Different FP16 Results? Uncovering AI Chip Precision Secrets

Floating‑point "mathematical traps"

Floating‑point numbers are stored as finite binary approximations. A decimal such as 0.1 becomes an infinite repeating fraction, so every operation introduces a rounding error. In large‑scale AI models the dominant computation is matrix‑multiply‑accumulate (GEMM). Because the order in which the many additions are performed is not fixed, the tiny rounding errors are summed in different ways on different devices, breaking the usual algebraic properties of associativity and commutativity.

Hardware design divergences

GPU Tensor Cores (e.g., NVIDIA) and Huawei Ascend Cube Cores implement distinct accumulator architectures. Tensor Cores use wide, multi‑stage accumulators optimized for dense 4×4 × 4 matrix tiles, while Cube Cores provide a dedicated AI data‑flow path with a different bit‑width and handling of sub‑normal values. These structural differences change the exact value that is stored after each partial sum, producing systematic but small offsets between chips of the same precision (FP16/BF16).

Software‑stack influence

Math libraries such as cuBLAS (NVIDIA) and CANN (Huawei) choose different GEMM blocking sizes, loop‑unrolling factors, and memory‑access patterns. A larger block size means fewer addition steps but larger intermediate sums, which shifts the rounding direction. Compilers (NVCC vs. CANN’s compiler) may also insert or omit fused‑multiply‑add (FMA) instructions and reorder independent operations. Both library and compiler decisions alter the exact sequence of floating‑point additions, magnifying the rounding error over billions of operations.

Parallel execution nondeterminism

GPU kernels are scheduled as warps; NPU kernels are scheduled as independent processes. The scheduler can interleave thread‑blocks differently from run to run, so the reduction order of partial results varies. Additionally, cache‑coherency protocols differ (e.g., NVIDIA’s L2 vs. Ascend’s shared‑memory model), causing slight timing variations that affect when a value is written back and consequently which rounding path is taken.

Error accumulation "butterfly effect"

All chips follow IEEE‑754 "round‑to‑nearest-even", but the implementation of the rounding unit, the width of the internal accumulator, and the handling of denormals differ by a few ulps (units in the last place). A single GEMM may differ by 10⁻⁶ – 10⁻⁸, but after 10¹²‑10¹⁴ floating‑point operations the error can grow to the order of 10⁻³ – 10⁻², which is observable in model outputs. Empirically, large language models tolerate such drift as long as the final error stays below a few thousandths; convergence and inference quality remain unchanged.

Practical guidance

Expect a baseline precision gap between GPUs, NPUs, and even between generations of the same chip (e.g., A100 vs. H100, Ascend A1 vs. A2).

When reproducibility is required, fix the execution order: use deterministic kernels, disable tensor‑core auto‑fusion, and enforce a single‑thread reduction path.

Validate the numerical error budget: compare model logits with a high‑precision reference (e.g., FP64) and ensure the L2 norm of the difference is < 0.001.

Choose libraries and compiler flags that expose the same GEMM blocking size across devices, or explicitly control the accumulation order with custom kernels.

In summary, precision differences in AI inference are an inevitable consequence of floating‑point representation, hardware accumulator design, library/compiler choices, and nondeterministic parallel scheduling. Controlling these factors—rather than eliminating them—keeps the error within a tolerable range for large‑model training and inference.

AIGPUlarge modelsNPUhardware designfloating point precision
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.