Fundamentals 21 min read

Unlocking CPU Speed: How Cache Hierarchy and False Sharing Impact Performance

This article explains CPU cache levels, cache line concepts, coherence protocols, and how cache‑friendly or cache‑unfriendly code patterns—including false sharing and stride access—affect program performance on modern multicore processors.

Open Source Linux

Dec 15, 2021

Unlocking CPU Speed: How Cache Hierarchy and False Sharing Impact Performance

Regardless of the code you write, the CPU executes it, so understanding cache hierarchy is essential for high‑performance programs.

Fundamental Knowledge

Modern CPUs have multiple cores and several cache levels. Old CPUs had two levels (L1 and L2); newer ones add a third level (L3). L1 is split into instruction and data caches, while L2 and L3 are unified.

L1 cache has separate instruction and data lines; L2/L3 store both.

L1 and L2 are per‑core; L3 is shared among all cores.

The closer a cache is to the CPU, the smaller and faster it is.

Typical access latencies are roughly 4 CPU cycles for L1, 11 cycles for L2, 39 cycles for L3, and about 107 cycles for main memory (RAM).

Because L1/L2 are only a few kilobytes and L3 a few megabytes, the CPU must balance capacity and speed.

Cache Hit

A cache line is the smallest unit transferred between memory and cache, usually 64 bytes (sometimes 32 or 128 bytes). A CPU loads data in whole cache lines, not byte‑by‑byte.

Cache placement uses associativity. Simple mapping lets any address map to any line (high flexibility but O(n) search). A hash‑based method computes (address mod N) * lineSize to find the line, requiring uniform address distribution. Set‑associative (N‑Way) mapping groups N lines together, reducing search cost.

For an 8‑Way L1 cache of 32 KB, there are 512 lines, 64 per way, each way holding 64 × 64 = 4096 bytes.

Tag: upper bits of the address (e.g., 24 bits) identify the line.

Index: middle bits select the set (e.g., 6 bits for 64 sets).

Offset: lower bits locate data inside the line.

This mapping is analogous to a hash table: O(1) index followed by a small linear search within the set.

Cache Consistency

Write‑back stores updates only in cache and flushes later; write‑through writes to both cache and memory immediately. Modern CPUs use write‑back for speed.

When a core updates a value, other cores must see the change. Two common hardware protocols are:

Directory protocol : a centralized controller tracks the state of each cache line in main memory.

Snoopy protocol : caches broadcast their actions on a shared bus, allowing others to “snoop” and react.

Most modern CPUs employ snoopy‑based designs because they avoid the bottleneck of a central directory.

Cache‑coherence states are often described by the MESI protocol (Modified, Exclusive, Shared, Invalid). More advanced protocols such as MOESI or MESIF add Owner or Forward states to reduce unnecessary memory traffic.

Program Performance

Example 1 – Loop Stride

const int LEN = 64*1024*1024;
int *arr = new int[LEN];
for (int i = 0; i < LEN; i += 2) arr[i] *= i;
for (int i = 0; i < LEN; i += 8) arr[i] *= i;

Both loops run at similar speed because the CPU loads 64‑byte cache lines; changing the stride does not reduce the number of cache line loads.

Example 2 – Stride vs. Cache Conflict

for (int i = 0; i < 10000000; i++) {
    for (int j = 0; j < size; j += increment) {
        memory[j] += j;
    }
}

When the stride exceeds the cache‑line size, accesses map to the same set, causing conflict misses and a sharp increase in execution time.

Example 3 – Row vs. Column Traversal

const int row = 1024;
const int col = 512;
int matrix[row][col];
int sum_row = 0;
for (int r = 0; r < row; r++)
    for (int c = 0; c < col; c++)
        sum_row += matrix[r][c];
int sum_col = 0;
for (int c = 0; c < col; c++)
    for (int r = 0; r < row; r++)
        sum_col += matrix[r][c];

Row‑major traversal stays within a cache line, while column‑major jumps to a new line each iteration, making the latter tens of times slower.

Example 4 – False Sharing

void fn(int* data) {
    for (int i = 0; i < 10*1024*1024; ++i)
        *data += rand();
}
int p[32];
int *p1 = &p[0];
int *p2 = &p[1]; // or &p[30]
thread t1(fn, p1);
thread t2(fn, p2);

When two threads write to variables that reside on the same cache line (e.g., p[0] and p[1]), the line must be constantly synchronized, causing a ~5× slowdown compared to writing to separate lines (e.g., p[0] and p[30]).

Example 5 – Reducing False Sharing with Local Variables

void thread_func(int id) {
    result[id] = 0;
    int chunk = total_size / nthread + 1;
    int start = id * chunk;
    int end = min(start + chunk, total_size);
    int c = 0; // local accumulator
    for (int i = start; i < end; ++i)
        if (test_data[i] % 2 != 0) ++c;
    result[id] = c;
}

Using a thread‑local accumulator avoids writing to the same cache line from multiple threads, dramatically improving scaling as thread count grows.

Performance Graph

The graph shows that the cache‑friendly version (orange line) scales almost linearly up to the number of physical cores, while the cache‑unfriendly version (gray line) plateaus.

Source: 程序员cxuan – https://mp.weixin.qq.com/s/s9w--YRkyAvQi4LQcenq4g

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multithreading CPU cache false sharing cache hierarchy

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Fundamental Knowledge

Cache Hit

Cache Consistency

Program Performance

Example 1 – Loop Stride

Example 2 – Stride vs. Cache Conflict

Example 3 – Row vs. Column Traversal

Example 4 – False Sharing

Example 5 – Reducing False Sharing with Local Variables

Performance Graph

Open Source Linux

How this landed with the community

Was this worth your time?

0 Comments

Example 1 – Loop Stride

Example 2 – Stride vs. Cache Conflict

Example 3 – Row vs. Column Traversal

Example 4 – False Sharing

Example 5 – Reducing False Sharing with Local Variables