Fundamentals 13 min read

Investigation of Query‑Diff Precision Differences Caused by CPU Instruction‑Set Variations (AVX vs SSE)

A detailed case study shows how a 1% precision difference discovered by query‑diff testing was traced to CPU instruction‑set discrepancies (AVX vs SSE), highlighting the impact of hardware‑level floating‑point optimizations on algorithmic results and providing practical debugging and mitigation guidelines.

Baidu Intelligent Testing

Dec 9, 2015

Investigation of Query‑Diff Precision Differences Caused by CPU Instruction‑Set Variations (AVX vs SSE)

In Baidu's Quality Assurance team, QA engineers routinely discover, locate, and drive the proper fixing of bugs. This "Golden BUG" article compiles several real cases of bug discovery, analysis, and resolution.

1. Query‑diff testing reveals a problem

Query‑diff is a common testing method for retrieval systems that sends identical queries to a baseline version and a test version of a module, then compares the returned results. In this case, module A (written in C++) outputs a single‑precision float Q. After an upgrade, query‑diff showed a precision difference of about 1% (max diff in the ten‑thousandths place), even though the upgrade was expected to be diff‑free.

2. Deep investigation

The team first considered two main directions: environment or program. Environment checks (configuration, vocabularies, compilation tools) were ruled out. Program checks eliminated random strategies, thread or process cache issues, and variable conversion problems. Replacing the program with the new version still reproduced the diff, indicating the cause was not in the program code itself.

Re‑examining the definition of "environment" led to checking the compilation environment and runtime environment. After recompiling both versions on the same local machine, the diff persisted, ruling out compilation factors. Copying both environments onto the same machine and applying heavy load made the diff disappear, pointing to a runtime‑environment factor.

Further runtime checks confirmed the OS (CentOS 4.3) was identical on both machines. Disk and memory differences were deemed unlikely. CPU comparison revealed the new machine used Xeon E5645 while the old used Xeon E5‑2620. Testing the new version on a machine with the old CPU eliminated the diff, identifying the CPU as the culprit.

3. Uncovering the truth

CPU instruction‑set differences were examined. After discarding core count, thread count, and cache size, the team focused on the instruction set. AVX (Advanced Vector Extensions) was present on the new CPU but not on the old one, while both supported SSE.

Supplementary Knowledge 1: CPU Instruction Sets Instruction sets are low‑level programs stored in the CPU that guide and optimize operations. Two key techniques are SISD (single‑instruction single‑data) and SIMD (single‑instruction multiple‑data). SIMD, such as SSE and AVX, allows a single instruction to process multiple data elements simultaneously, greatly benefiting data‑intensive computations.

Further investigation showed that the static library libX (used indirectly by module A) employed SSE optimizations and Intel's Math Kernel Library (MKL), but not AVX. However, Intel documentation confirmed that MKL includes AVX‑optimized code paths, and the AVX2 FMA instruction can improve both performance and precision for floating‑point operations.

Supplementary Knowledge 2: Floating‑Point Storage Both float (IEEE‑754 R32.24) and double (IEEE‑754 R64.53) consist of a sign bit, an exponent, and a mantissa.

The hardware FPU operates at 80‑bit precision, while SSE/AVX output 32‑bit floats. If the FPU precision differs between code paths, truncation to 32‑bit can introduce a small but measurable diff.

Thus, the 1‑bit extra precision of AVX’s FMA compared to SSE caused the observed diff, which was amplified through the matrix‑heavy computation of Q. Intel’s MKL automatically falls back to SSE on CPUs lacking AVX, explaining why the diff vanished on the older machine.

Supplementary Knowledge 3: Using SSE/AVX for Optimization Basic version: simple loop accumulation. SSE version: 128‑bit registers hold 4 floats; use SIMD add instructions. AVX version: 256‑bit registers hold 8 floats; similar approach with wider registers.

Performance tests with random input arrays showed SSE achieving ~4× speedup over the basic version, while AVX reached ~8×.

4. Summary and Insights

Query‑diff testing uncovered a precision diff in module A’s Q value.

Root cause: CPU instruction‑set difference (AVX vs SSE) affecting floating‑point precision.

The diff is small now, but could accumulate in more complex algorithms and affect service correctness.

Other modules using similar instruction‑set optimizations should be checked.

Solutions

Ensure test environments use machines with identical CPUs.

Add hardware‑check steps before running query‑diff.

Deploy services on machines that support the required instruction set (AVX) for optimal performance and precision.

Audit other modules for hidden instruction‑set optimizations.

Recommendations

Consider SSE/AVX optimizations for floating‑point‑intensive code to boost efficiency.

Control the number of iterative uses of SIMD‑optimized functions to prevent precision drift.

Extend query‑diff testing to other compatibility dimensions such as CPU, OS, and library versions.

Software engineering inevitably intertwines with hardware; differences in compilation and runtime environments can affect both performance and final computation results. Being a "software‑hardware combined" engineer is essential.

References:

https://software.intel.com/zh-cn/articles/whats-new-in-intel-mkl

https://software.intel.com/zh-cn/articles/intel-xeon-processor-e7-88004800-v3-product-family-technical-overview

https://software.intel.com/en-us/forums/topic/507004

http://www.cnblogs.com/zyl910/archive/2012/10/22/simdsumfloat.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization CPU SSE AVX floating point precision Query-diff

Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.