Fundamentals 21 min read

Understanding CPU Cache Hierarchy, Cache Coherence, and Performance Optimization

This article explains the structure of modern CPU caches, the principles of cache lines, associativity, and coherence protocols, and demonstrates how these hardware details affect program performance through multiple code examples covering loop stride, matrix traversal, multithreading, and false sharing.

Top Architect

Feb 1, 2022

Understanding CPU Cache Hierarchy, Cache Coherence, and Performance Optimization

Basic Knowledge

Modern CPUs use multi‑level caches (L1, L2, L3) to bridge the speed gap between the processor and main memory. L1 is split into instruction and data caches, while L2 and L3 are unified. The closer a cache level is to the CPU, the smaller and faster it is.

Typical access latencies are:

L1: ~4 CPU cycles

L2: ~11 CPU cycles

L3: ~39 CPU cycles

DRAM: ~107 CPU cycles

Because L1 is about 27× faster than DRAM, but its size is only a few kilobytes, the memory hierarchy is designed to balance capacity and speed.

Cache Line and Associativity

A cache line is the smallest unit transferred between memory and cache, typically 64 bytes (16 × 32‑bit integers). The CPU maps memory addresses to cache lines using tag, index, and offset bits.

Mapping methods include:

Fully associative: any address can go to any line (inefficient lookup).

Set‑associative (N‑Way): the address is first hashed to a set, then placed in one of the N lines of that set. Example: 8‑Way for a 32 KB L1 cache gives 512 lines, 64 sets of 8 lines each.

Tag, index, and offset are extracted from the address (e.g., for a 64‑line L1: (address mod 512) * 64).

Cache Hit and Miss

Cache hit occurs when the requested data resides in the cache line identified by the tag; otherwise a miss triggers a fetch from the next level.

Cache Coherence

Two main protocols keep caches consistent across cores:

Directory protocol : a centralized directory tracks the state of each cache line.

Snoopy protocol : cores monitor a shared bus and broadcast state changes (similar to micro‑service messaging).

Common coherence state machines include MESI (Modified, Exclusive, Shared, Invalid) and its extensions MOESI and MESIF.

Performance Examples

Example 1 – Loop Stride

const int LEN = 64*1024*1024;
int *arr = new int[LEN];
for (int i = 0; i < LEN; i += 2) arr[i] *= i;
for (int i = 0; i < LEN; i += 8) arr[i] *= i;

Both loops run at similar speed because each iteration accesses the same cache line size.

Example 2 – Increment Step

for (int i = 0; i < 10000000; i++) {
    for (int j = 0; j < size; j += increment) {
        memory[j] += j;
    }
}

Increasing the step size eventually causes cache‑set conflicts, dramatically raising execution time.

Example 3 – Row vs. Column Traversal

const int row = 1024;
const int col = 512;
int matrix[row][col];
int sum_row = 0;
for (int r = 0; r < row; r++)
    for (int c = 0; c < col; c++)
        sum_row += matrix[r][c];
int sum_col = 0;
for (int c = 0; c < col; c++)
    for (int r = 0; r < row; r++)
        sum_col += matrix[r][c];

Row‑major traversal is ~10× faster because it accesses memory sequentially, keeping cache lines hot.

Example 4 – False Sharing

void fn(int* data) {
    for (int i = 0; i < 10*1024*1024; ++i)
        *data += rand();
}
int p[32];
int *p1 = &p[0];
int *p2 = &p[1]; // or &p[30]
thread t1(fn, p1);
thread t2(fn, p2);

When two threads modify variables that share the same cache line (e.g., p[0] and p[1]), execution time is ~5× slower than when they modify variables on different lines (e.g., p[0] and p[30]).

Example 5 – Reducing False Sharing with Local Accumulator

int total_size = 16*1024*1024;
int* test_data = new int[total_size];
int nthread = 6;
int result[nthread];
void thread_func(int id) {
    result[id] = 0;
    int chunk = total_size / nthread + 1;
    int start = id * chunk;
    int end = min(start + chunk, total_size);
    int c = 0; // local accumulator
    for (int i = start; i < end; ++i)
        if (test_data[i] % 2 != 0) ++c;
    result[id] = c;
}

Using a thread‑local variable eliminates contention on the shared result[] cache line, allowing near‑linear scaling with the number of cores.

Conclusion

Understanding cache hierarchy, line size, associativity, and coherence is essential for writing high‑performance code. Small changes in data layout or access pattern can cause large differences in execution time due to cache hits, misses, and false sharing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization multithreading Memory Hierarchy CPU cache false sharing cache coherence cache line

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Basic Knowledge

Cache Line and Associativity

Cache Hit and Miss

Cache Coherence

Performance Examples

Example 1 – Loop Stride

Example 2 – Increment Step

Example 3 – Row vs. Column Traversal

Example 4 – False Sharing

Example 5 – Reducing False Sharing with Local Accumulator

Conclusion

Top Architect

How this landed with the community

Was this worth your time?

0 Comments

Example 1 – Loop Stride

Example 2 – Increment Step

Example 3 – Row vs. Column Traversal

Example 4 – False Sharing

Example 5 – Reducing False Sharing with Local Accumulator