Why Adding Unnecessary Sorting Can Triple Your x86 Code Speed – A Deep Dive into Performance Metrics
This article explores x86 performance optimization by comparing a simple sum‑of‑array loop with and without a pre‑sort step, demonstrating how branch prediction and cache behavior can make seemingly redundant code run up to three times faster, and outlines practical benchmarking principles and common pitfalls.
Introduction
High‑performance software development on x86/Linux combines programming techniques, hardware architecture, operating‑system behavior, and compiler knowledge. The author uses C examples to illustrate principles rather than language specifics.
What Is "Performance"?
Performance is fundamentally a time metric: completing a given workload in less time. Common proxies such as instruction count, cache hits, or CPU cycles are useful but not sufficient alone. The article emphasizes measuring actual execution time with hardware counters.
Example: Summing Values Below 128
A random array of uint32_t values (0‑255) is generated. Two approaches are compared:
Directly iterate and sum values < 128.
Sort the array first, then iterate.
The relevant code snippets are:
uint32_t i = 0;
for (; i < ARRAY_LEN; i++) {
array[i] = rand() % 256;
} #ifdef PRE_SORT
qsort(&array, ARRAY_LEN, sizeof(uint32_t), qsort_cmp);
#endif
uint64_t sum = 0;
for (uint32_t ite_count = 0; ite_count < 10000; ite_count++) {
for (i = 0; i < ARRAY_LEN; i++) {
if (array[i] < 128) {
sum += array[i];
}
}
}Benchmark Results
Compiled with default gcc:
# gcc -DPRE_SORT branch.c
# ./a.out
Sum: 20787670000, CPU cycles: 4869610134Without sorting:
# gcc branch.c
# ./a.out
sum: 20787670000, CPU cycles: 14611881138The sorted version runs roughly three times faster despite executing more instructions and branches, because branch‑prediction failures drop dramatically after sorting.
Using perf to Analyse
For the unsorted run:
5,902,900,466 instructions # 0.42 insns per cycle
1,312,563,125 branches # 359.846 M/sec
326,150,404 branch‑misses # 24.85% of all branchesFor the sorted run:
5,939,795,977 instructions # 1.26 insns per cycle
1,321,069,122 branches # 1084.608 M/sec
406,006 branch‑misses # 0.03% of all branchesThe dramatic reduction in branch‑miss rate explains the speedup.
Benchmarking Principles
Reproducibility : Run tests repeatedly, minimize system noise (other workloads, I/O, NUMA effects).
Cover Typical Execution Paths : Design inputs that reflect real‑world usage rather than corner cases.
Simplicity : Small, focused benchmarks give clear feedback and keep developers motivated.
Common Optimization Pitfalls
Optimizing before functional correctness leads to wasted effort.
Relying on intuition can be misleading; empirical measurement is essential.
Blindly applying generic tricks (prefetch, core affinity, hugepages) without understanding the workload may degrade performance.
Conclusion
The article demonstrates that performance engineering is an iterative, measurement‑driven process. By using tools like perf, understanding branch prediction, and following disciplined benchmarking practices, developers can achieve substantial speedups even with seemingly counter‑intuitive changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
