Fundamentals 21 min read

Understanding CPU Cache Hierarchy, Cache Coherence, and Performance Optimization

This article explains the structure of modern CPU caches, the principles of cache lines, associativity, and coherence protocols, and demonstrates how these hardware details affect program performance through multiple code examples covering loop stride, matrix traversal, multithreading, and false sharing.

Top Architect
Top Architect
Top Architect
Understanding CPU Cache Hierarchy, Cache Coherence, and Performance Optimization

Basic Knowledge

Modern CPUs use multi‑level caches (L1, L2, L3) to bridge the speed gap between the processor and main memory. L1 is split into instruction and data caches, while L2 and L3 are unified. The closer a cache level is to the CPU, the smaller and faster it is.

Typical access latencies are:

L1: ~4 CPU cycles

L2: ~11 CPU cycles

L3: ~39 CPU cycles

DRAM: ~107 CPU cycles

Because L1 is about 27× faster than DRAM, but its size is only a few kilobytes, the memory hierarchy is designed to balance capacity and speed.

Cache Line and Associativity

A cache line is the smallest unit transferred between memory and cache, typically 64 bytes (16 × 32‑bit integers). The CPU maps memory addresses to cache lines using tag, index, and offset bits.

Mapping methods include:

Fully associative: any address can go to any line (inefficient lookup).

Set‑associative (N‑Way): the address is first hashed to a set, then placed in one of the N lines of that set. Example: 8‑Way for a 32 KB L1 cache gives 512 lines, 64 sets of 8 lines each.

Tag, index, and offset are extracted from the address (e.g., for a 64‑line L1: (address mod 512) * 64 ).

Cache Hit and Miss

Cache hit occurs when the requested data resides in the cache line identified by the tag; otherwise a miss triggers a fetch from the next level.

Cache Coherence

Two main protocols keep caches consistent across cores:

Directory protocol : a centralized directory tracks the state of each cache line.

Snoopy protocol : cores monitor a shared bus and broadcast state changes (similar to micro‑service messaging).

Common coherence state machines include MESI (Modified, Exclusive, Shared, Invalid) and its extensions MOESI and MESIF.

Performance Examples

Example 1 – Loop Stride

const int LEN = 64*1024*1024;
int *arr = new int[LEN];
for (int i = 0; i < LEN; i += 2) arr[i] *= i;
for (int i = 0; i < LEN; i += 8) arr[i] *= i;

Both loops run at similar speed because each iteration accesses the same cache line size.

Example 2 – Increment Step

for (int i = 0; i < 10000000; i++) {
    for (int j = 0; j < size; j += increment) {
        memory[j] += j;
    }
}

Increasing the step size eventually causes cache‑set conflicts, dramatically raising execution time.

Example 3 – Row vs. Column Traversal

const int row = 1024;
const int col = 512;
int matrix[row][col];
int sum_row = 0;
for (int r = 0; r < row; r++)
    for (int c = 0; c < col; c++)
        sum_row += matrix[r][c];
int sum_col = 0;
for (int c = 0; c < col; c++)
    for (int r = 0; r < row; r++)
        sum_col += matrix[r][c];

Row‑major traversal is ~10× faster because it accesses memory sequentially, keeping cache lines hot.

Example 4 – False Sharing

void fn(int* data) {
    for (int i = 0; i < 10*1024*1024; ++i)
        *data += rand();
}
int p[32];
int *p1 = &p[0];
int *p2 = &p[1]; // or &p[30]
thread t1(fn, p1);
thread t2(fn, p2);

When two threads modify variables that share the same cache line (e.g., p[0] and p[1]), execution time is ~5× slower than when they modify variables on different lines (e.g., p[0] and p[30]).

Example 5 – Reducing False Sharing with Local Accumulator

int total_size = 16*1024*1024;
int* test_data = new int[total_size];
int nthread = 6;
int result[nthread];
void thread_func(int id) {
    result[id] = 0;
    int chunk = total_size / nthread + 1;
    int start = id * chunk;
    int end = min(start + chunk, total_size);
    int c = 0; // local accumulator
    for (int i = start; i < end; ++i)
        if (test_data[i] % 2 != 0) ++c;
    result[id] = c;
}

Using a thread‑local variable eliminates contention on the shared result[] cache line, allowing near‑linear scaling with the number of cores.

Conclusion

Understanding cache hierarchy, line size, associativity, and coherence is essential for writing high‑performance code. Small changes in data layout or access pattern can cause large differences in execution time due to cache hits, misses, and false sharing.

Performance OptimizationMultithreadingmemory hierarchyCPU cachefalse sharingCache Coherencecache line
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.