Unlocking CPU Speed: How Cache Hierarchy and False Sharing Impact Performance
This article explains CPU cache levels, cache line concepts, coherence protocols, and how cache‑friendly or cache‑unfriendly code patterns—including false sharing and stride access—affect program performance on modern multicore processors.
Regardless of the code you write, the CPU executes it, so understanding cache hierarchy is essential for high‑performance programs.
Fundamental Knowledge
Modern CPUs have multiple cores and several cache levels. Old CPUs had two levels (L1 and L2); newer ones add a third level (L3). L1 is split into instruction and data caches, while L2 and L3 are unified.
L1 cache has separate instruction and data lines; L2/L3 store both.
L1 and L2 are per‑core; L3 is shared among all cores.
The closer a cache is to the CPU, the smaller and faster it is.
Typical access latencies are roughly 4 CPU cycles for L1, 11 cycles for L2, 39 cycles for L3, and about 107 cycles for main memory (RAM).
Because L1/L2 are only a few kilobytes and L3 a few megabytes, the CPU must balance capacity and speed.
Cache Hit
A cache line is the smallest unit transferred between memory and cache, usually 64 bytes (sometimes 32 or 128 bytes). A CPU loads data in whole cache lines, not byte‑by‑byte.
Cache placement uses associativity. Simple mapping lets any address map to any line (high flexibility but O(n) search). A hash‑based method computes (address mod N) * lineSize to find the line, requiring uniform address distribution. Set‑associative (N‑Way) mapping groups N lines together, reducing search cost.
For an 8‑Way L1 cache of 32 KB, there are 512 lines, 64 per way, each way holding 64 × 64 = 4096 bytes.
Tag: upper bits of the address (e.g., 24 bits) identify the line.
Index: middle bits select the set (e.g., 6 bits for 64 sets).
Offset: lower bits locate data inside the line.
This mapping is analogous to a hash table: O(1) index followed by a small linear search within the set.
Cache Consistency
Write‑back stores updates only in cache and flushes later; write‑through writes to both cache and memory immediately. Modern CPUs use write‑back for speed.
When a core updates a value, other cores must see the change. Two common hardware protocols are:
Directory protocol : a centralized controller tracks the state of each cache line in main memory.
Snoopy protocol : caches broadcast their actions on a shared bus, allowing others to “snoop” and react.
Most modern CPUs employ snoopy‑based designs because they avoid the bottleneck of a central directory.
Cache‑coherence states are often described by the MESI protocol (Modified, Exclusive, Shared, Invalid). More advanced protocols such as MOESI or MESIF add Owner or Forward states to reduce unnecessary memory traffic.
Program Performance
Example 1 – Loop Stride
const int LEN = 64*1024*1024;
int *arr = new int[LEN];
for (int i = 0; i < LEN; i += 2) arr[i] *= i;
for (int i = 0; i < LEN; i += 8) arr[i] *= i;Both loops run at similar speed because the CPU loads 64‑byte cache lines; changing the stride does not reduce the number of cache line loads.
Example 2 – Stride vs. Cache Conflict
for (int i = 0; i < 10000000; i++) {
for (int j = 0; j < size; j += increment) {
memory[j] += j;
}
}When the stride exceeds the cache‑line size, accesses map to the same set, causing conflict misses and a sharp increase in execution time.
Example 3 – Row vs. Column Traversal
const int row = 1024;
const int col = 512;
int matrix[row][col];
int sum_row = 0;
for (int r = 0; r < row; r++)
for (int c = 0; c < col; c++)
sum_row += matrix[r][c];
int sum_col = 0;
for (int c = 0; c < col; c++)
for (int r = 0; r < row; r++)
sum_col += matrix[r][c];Row‑major traversal stays within a cache line, while column‑major jumps to a new line each iteration, making the latter tens of times slower.
Example 4 – False Sharing
void fn(int* data) {
for (int i = 0; i < 10*1024*1024; ++i)
*data += rand();
}
int p[32];
int *p1 = &p[0];
int *p2 = &p[1]; // or &p[30]
thread t1(fn, p1);
thread t2(fn, p2);When two threads write to variables that reside on the same cache line (e.g., p[0] and p[1]), the line must be constantly synchronized, causing a ~5× slowdown compared to writing to separate lines (e.g., p[0] and p[30]).
Example 5 – Reducing False Sharing with Local Variables
void thread_func(int id) {
result[id] = 0;
int chunk = total_size / nthread + 1;
int start = id * chunk;
int end = min(start + chunk, total_size);
int c = 0; // local accumulator
for (int i = start; i < end; ++i)
if (test_data[i] % 2 != 0) ++c;
result[id] = c;
}Using a thread‑local accumulator avoids writing to the same cache line from multiple threads, dramatically improving scaling as thread count grows.
Performance Graph
The graph shows that the cache‑friendly version (orange line) scales almost linearly up to the number of physical cores, while the cache‑unfriendly version (gray line) plateaus.
Source: 程序员cxuan – https://mp.weixin.qq.com/s/s9w--YRkyAvQi4LQcenq4g
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
