Why Sorting an Array Speeds Up Summation: CPU Pipeline, Hazards, and Branch Prediction Explained
The article examines a puzzling StackOverflow case where sorting a random array before summation yields a six‑fold speedup, explains the phenomenon through CPU five‑stage pipeline fundamentals, structural, data, and control hazards, and shows how branch prediction and operand forwarding mitigate the performance loss.
Interesting Question
While browsing StackOverflow, the author found a curious question where a C++ program sums an array much faster after sorting it, with a six‑times performance gain.
#include
#include
#include
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast
(clock()-start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
}The unsorted run took about 11.54 seconds, while the sorted run took only 1.93 seconds. A similar test in Java showed comparable results.
The Car‑Wash Analogy
The author likens the CPU pipeline to a car‑wash where cars move through stages (spray, foam, brush, wipe, dry) in a pipeline fashion, increasing throughput compared to processing each car sequentially.
CPU Internals
A CPU consists of registers, a control unit (CU), an arithmetic‑logic unit (ALU), caches, and three main buses (address, data, control) that connect it to RAM and I/O devices.
Five‑Stage Pipeline
The classic five stages are:
Instruction Fetch (IF)
Instruction Decode (ID)
Execute (EX)
Memory Access (MEM)
Write‑Back (WB)
Each stage processes a different instruction simultaneously, similar to the car‑wash pipeline.
Pipeline Hazards
Three types of hazards can stall the pipeline:
Structural Hazard : hardware resource conflict, e.g., IF and MEM both need the same memory port. Solution: separate instruction and data memories (Harvard architecture).
Data Hazard : instruction dependencies, e.g., int a = 10; int b = a + 10; // depends on a int c = b + a; // depends on b Solution: insert NOPs (pipeline bubbles) or use operand forwarding to pass results directly between stages.
Control Hazard : branch instructions change the flow of execution. The pipeline must flush instructions fetched after a mispredicted branch, incurring a penalty of 10‑20 cycles.
Branch Prediction
Two categories exist:
Static prediction: always assume not‑taken (≈50 % accuracy).
Dynamic prediction: use history to guess. A common 2‑bit predictor has four states (strongly/weakly taken or not‑taken) and changes state only after two consecutive mispredictions, achieving >95 % accuracy in many workloads. Connecting Back to the Original Problem Random data plus many loop‑condition checks trigger frequent branch mispredictions, causing pipeline flushes and large slowdowns. Sorting the array makes the branch condition ( data[c] >= 128 ) highly predictable, allowing the dynamic predictor to stay in a “taken” or “not‑taken” state and dramatically reducing stalls, which explains the observed speedup. Conclusion The article starts from a StackOverflow performance mystery, explains it with fundamental CPU pipeline concepts, describes structural, data, and control hazards and their mitigations, and shows how branch prediction—especially a 2‑bit predictor—turns a seemingly paradoxical speedup into an expected outcome.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.