Fundamentals 24 min read

Understanding CPU Cache Consistency: MESI Protocol, Performance Tips & Code Examples

Cache consistency spans icache‑dcache synchronization, multi‑CPU cache coherence, and device‑CPU interactions; the article explains the MESI protocol, demonstrates performance impacts with multithreaded code, explores prefetching, false sharing, mapping strategies, and practical tips for writing cache‑aware software.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Understanding CPU Cache Consistency: MESI Protocol, Performance Tips & Code Examples

Overview

This article explains the multiple layers of CPU cache consistency, from the synchronization of a single core’s instruction and data caches to the coherence protocols required for multi‑core processors and device‑CPU interactions. It also shows how cache‑related performance issues can be measured and mitigated.

Cache Consistency Layers

Three main consistency problems are highlighted:

Synchronization between a core’s icache (instruction cache) and dcache (data cache).

Coherence among the caches of multiple CPUs.

Cache synchronization between CPUs and external devices (e.g., DMA engines).

ICache‑DCache Synchronization

When self‑modifying code (e.g., JIT‑generated functions) updates a memory location p, the new instructions are written to the dcache. If the icache still holds the old version, the CPU may execute stale code. The software must clean the dcache and invalidate the icache, which incurs noticeable overhead. Some ARM64 processors (e.g., the Neoverse N1) provide hardware‑assisted icache synchronization, reducing this cost.

Multi‑CPU Cache Coherence

In a simplified system, CPUs A and B share an L3 cache, while CPUs C and D share another L3. When one core modifies a cache line, other cores must see the updated value. This is achieved through cache‑coherence protocols such as MESI or MOESI. The article focuses on MESI, which defines four states:

M (Modified) : The line is dirty and present only in the owning cache.

E (Exclusive) : The line is clean and present only in one cache.

S (Shared) : The line is clean and may exist in multiple caches.

I (Invalid) : The line is not valid in the cache.

The state machine ensures that when a core writes to a line, other cores’ copies are invalidated, and the modified data is eventually written back to memory or shared as needed.

Performance Impact of Cache Coherence

A two‑threaded program that writes to one cache line and reads from another demonstrates the cost of coherence. Running on two cores yields a real time of 3.614 s, while forcing both threads onto a single core reduces the time to 3.299 s. Changing the second thread to read a different cache line eliminates the coherence traffic, cutting the runtime to 1.820 s.

$ time ./a.out
real  0m3.614s
user  0m7.021s
sys   0m0.004s
$ time ./b.out
real  0m1.820s
user  0m3.606s
sys   0m0.008s

Device‑CPU Cache Synchronization

When a device performs DMA, the CPU must ensure that the cache is clean and invalidated for the affected memory region. Linux provides streaming DMA APIs (e.g., dma_map_single(), dma_unmap_single(), dma_sync_single_for_cpu(), dma_sync_single_for_device()) that hide the hardware differences. If dma_alloc_coherent() is used, the buffer is already cache‑coherent, avoiding explicit synchronization.

Prefetching

Hardware prefetchers can reduce L3‑miss stalls, but software‑controlled prefetching can further improve performance. A classic binary‑search implementation wrapped with a prefetch() macro using the ARM pld instruction shows a 14 % speedup (10 s vs. 11.6 s) when prefetching is enabled.

static inline void prefetch(const void *ptr) {
    __asm__ __volatile__("pld %a0" :: "p" (ptr));
}

False Sharing

When multiple threads write to different variables that reside on the same cache line, the line bounces between cores, causing severe performance degradation. Padding or reordering struct members to place frequently written fields on separate cache lines eliminates this “false sharing.”

struct s {
    int a;
    char padding[cacheline_size - sizeof(int)];
    int b;
};

Cache Mapping Strategies

The article reviews three mapping schemes:

Direct mapping : Each memory block maps to exactly one cache line, simple but prone to conflicts.

Fully associative mapping : Any block can occupy any line, offering high flexibility but costly hardware.

Set‑associative mapping : A compromise where the cache is divided into sets; each set is fully associative. Common configurations are 2‑way, 4‑way, 8‑way, and 16‑way.

Replacement Policy

Most modern caches use the Least‑Recently‑Used (LRU) policy to evict the line that has not been accessed for the longest time.

Practical Programming Tips

To write cache‑friendly code, follow locality principles:

Temporal locality : Reuse data soon after it is loaded.

Spatial locality : Access data that is stored contiguously.

For example, iterating a matrix row‑major order yields far better performance than column‑major order because rows are stored contiguously in memory and thus better utilize the cache.

const int row = 1024;
const int col = 1024;
int matrix[row][col];

// Row‑major traversal (fast)
int sum_row = 0;
for (int r = 0; r < row; r++) {
    for (int c = 0; c < col; c++) {
        sum_row += matrix[r][c];
    }
}

// Column‑major traversal (slow)
int sum_col = 0;
for (int c = 0; c < col; c++) {
    for (int r = 0; r < row; r++) {
        sum_col += matrix[r][c];
    }
}

By respecting cache line boundaries, using appropriate padding, and employing prefetch instructions when necessary, developers can significantly reduce cache‑related stalls and improve overall program throughput.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceCacheCPUfalse sharingprefetchMESICoherence
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.