Fundamentals 14 min read

Why Adding Unnecessary Sorting Can Triple Your x86 Code Speed – A Deep Dive into Performance Metrics

This article explores x86 performance optimization by comparing a simple sum‑of‑array loop with and without a pre‑sort step, demonstrating how branch prediction and cache behavior can make seemingly redundant code run up to three times faster, and outlines practical benchmarking principles and common pitfalls.

ITPUB

Aug 22, 2017

Why Adding Unnecessary Sorting Can Triple Your x86 Code Speed – A Deep Dive into Performance Metrics

Introduction

High‑performance software development on x86/Linux combines programming techniques, hardware architecture, operating‑system behavior, and compiler knowledge. The author uses C examples to illustrate principles rather than language specifics.

What Is "Performance"?

Performance is fundamentally a time metric: completing a given workload in less time. Common proxies such as instruction count, cache hits, or CPU cycles are useful but not sufficient alone. The article emphasizes measuring actual execution time with hardware counters.

Example: Summing Values Below 128

A random array of uint32_t values (0‑255) is generated. Two approaches are compared:

Directly iterate and sum values < 128.

Sort the array first, then iterate.

The relevant code snippets are:

uint32_t i = 0;
for (; i < ARRAY_LEN; i++) {
    array[i] = rand() % 256;
}

#ifdef PRE_SORT
    qsort(&array, ARRAY_LEN, sizeof(uint32_t), qsort_cmp);
#endif
uint64_t sum = 0;
for (uint32_t ite_count = 0; ite_count < 10000; ite_count++) {
    for (i = 0; i < ARRAY_LEN; i++) {
        if (array[i] < 128) {
            sum += array[i];
        }
    }
}

Benchmark Results

Compiled with default gcc:

# gcc -DPRE_SORT branch.c
# ./a.out
Sum: 20787670000, CPU cycles: 4869610134

Without sorting:

# gcc branch.c
# ./a.out
sum: 20787670000, CPU cycles: 14611881138

The sorted version runs roughly three times faster despite executing more instructions and branches, because branch‑prediction failures drop dramatically after sorting.

Using perf to Analyse

For the unsorted run:

5,902,900,466 instructions # 0.42 insns per cycle
1,312,563,125 branches # 359.846 M/sec
326,150,404 branch‑misses # 24.85% of all branches

For the sorted run:

5,939,795,977 instructions # 1.26 insns per cycle
1,321,069,122 branches # 1084.608 M/sec
406,006 branch‑misses # 0.03% of all branches

The dramatic reduction in branch‑miss rate explains the speedup.

Benchmarking Principles

Reproducibility : Run tests repeatedly, minimize system noise (other workloads, I/O, NUMA effects).

Cover Typical Execution Paths : Design inputs that reflect real‑world usage rather than corner cases.

Simplicity : Small, focused benchmarks give clear feedback and keep developers motivated.

Common Optimization Pitfalls

Optimizing before functional correctness leads to wasted effort.

Relying on intuition can be misleading; empirical measurement is essential.

Blindly applying generic tricks (prefetch, core affinity, hugepages) without understanding the workload may degrade performance.

Conclusion

The article demonstrates that performance engineering is an iterative, measurement‑driven process. By using tools like perf, understanding branch prediction, and following disciplined benchmarking practices, developers can achieve substantial speedups even with seemingly counter‑intuitive changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization x86 C Programming CPU cycles Benchmarking branch prediction

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.