Fundamentals 10 min read

Why CPU Caches Matter: Levels, Coherence, and Memory Barriers

CPU caches, organized into L1‑L3 levels, accelerate memory access by exploiting locality, but their independent copies can cause data inconsistency across cores; coherence protocols such as MESI and memory‑barrier instructions ensure that reads and writes remain ordered and visible across all processors.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Why CPU Caches Matter: Levels, Coherence, and Memory Barriers

1. CPU Cache

Origin of CPU cache

During every instruction fetch cycle, the CPU must access memory at least once.

Multiple accesses to fetch operands or store results make CPU speed limited by memory latency.

Solution: use a small, fast storage between CPU and main memory, called cache, based on locality.

Cache Overview

Cache is divided into lines; each line is a block of 32, 64, or 128 bytes depending on architecture.

Cache holds copies of portions of physical memory.

When the CPU reads data, it first checks the cache; if present, it returns the data, otherwise it fetches from main memory.

Cache and Memory
Cache Levels L1‑L3

L1 Cache: First‑level cache, split into instruction and data caches; typical capacity 32‑4096 KB; cannot directly interface with memory.

L2 Cache: Second‑level cache placed outside the CPU to further improve speed.

L3 Cache: Shared among multiple cores, built‑in, reduces memory latency.

2. Cache Consistency and MESI Protocol

Read and write operations on a single‑CPU cache

Read: CPU looks in L1, then L2, then L3, then memory, then storage.

If only reads occur, all cache levels stay consistent with main memory.

Write: Write‑through: Data is written directly to the next cache level or main memory, updating or discarding the current cache line. Write‑back: Modified data is marked and later written back to the next level or main memory; if the flag is discarded, the data is first reclaimed.

Cache consistency issues in multicore

When one core updates a memory line in its cache, other cores may still hold stale copies, leading to inconsistency.

Cache consistency protocol
<code>How to solve the above problem, the reasons for cache data inconsistency are as follows:</code>

Each core has its own private cache, which cannot be shared.

Sharing caches across cores would reduce performance because cores must wait for exclusive write access.

The goal is a multi‑core cache that behaves like a single coherent cache; coherence protocols address this.

MESI Protocol
<code>There are many cache coherence protocols; a typical one is the MESI protocol. A brief description of MESI is as follows:</code>

Invalid (I): Cache line does not exist or is outdated.

Shared (S): Data is valid and consistent with main memory and other caches; used for reads.

Exclusive (E): Data is valid and exclusive to this processor; other caches must invalidate.

Modified (M): Data has been changed (dirty) and not yet written back to main memory; exclusive to this CPU.

Summary: The CPU controls cache reads/writes and monitors other CPUs to maintain final data consistency; the E state resolves the need to inform other processors before modifying data.

3. Memory Barriers

CPU optimization: runtime instruction reordering

Why reordering occurs

<code>When the CPU writes to the cache and finds the block occupied by another CPU, to improve processing performance it may prioritize subsequent read cache commands.</code>

Reordering principle

<code>Reordering must follow the as-if-serial semantics rule, meaning that regardless of reordering (by compiler or processor to increase parallelism), the result of a single‑threaded program cannot change. Compilers, runtimes, and processors must obey as-if-serial semantics, i.e., they will not reorder operations with data dependencies.</code>
Problems with CPU caches

Cache data and main memory are not synchronized in real time; different CPUs may see different values for the same address.

Instruction reordering can cause incorrect results in multithreaded programs despite as-if-serial guarantees for single threads.

Memory Barrier

Definition

<code>It is a type of synchronization barrier instruction that forces the CPU or compiler to perform memory operations in a strict order; instructions before the memory barrier and those after it will not be reordered due to system optimizations.</code>

Barrier instructions Write memory barrier: inserts a Store Barrier after the instruction, forcing the latest cache data to be written to main memory and become visible to other threads. Read memory barrier: inserts a Load Barrier before the instruction, invalidating cache lines so that data is reloaded from main memory. Full memory barrier: ensures that all memory reads/writes before the barrier are committed before any after the barrier are executed.

Purpose: solves the consistency problems of CPU caches.

Finally, the discussion of memory‑related writing aims to better understand the memory semantics of the synchronized keyword (next article).

CacheCPUMESImemory-barriermulticorecoherence
Xiaokun's Architecture Exploration Notes
Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.