Fundamentals 22 min read

Understanding Linux CPU Caches: From Physical Cores to Cache Coherence

This article explains Linux CPU cache architecture—from physical and logical cores, through L1/L2/L3 hierarchy and cache‑line basics, to write‑through/write‑back policies and coherence mechanisms—while demonstrating practical analysis with Valgrind and perf tools.

Linux Kernel Journey

Oct 18, 2024

Understanding Linux CPU Caches: From Physical Cores to Cache Coherence

Physical and Logical Cores

On a dual‑CPU Linux system each physical core exposes two logical cores via Hyper‑Threading, so the OS reports more cores than physically exist. The command cat /proc/cpuinfo | grep "core id" produces two distinct core id values (0 and 1), indicating two physical CPUs. Detailed output of cat /proc/cpuinfo shows two physical CPUs (ids 0 and 1) each with two physical cores, for a total of four physical cores. The lscpu command reports eight logical cores because each physical core is split into two logical ones.

CPU Cache Hierarchy

Why caches exist

Moore's Law doubles transistor counts roughly every 18‑24 months, but memory speed improves far more slowly, creating a 200‑300× latency gap between CPU and main memory. L1‑L3 caches bridge this gap by keeping frequently accessed data close to the core.

Multi‑level cache design

L1 is the smallest and fastest (typically 32 KB data + 32 KB instruction per core) and is split into data and instruction caches. L2 is larger (e.g., 256 KB) with slightly higher latency and is private to each core. L3 is shared among all cores (e.g., 8 MB) and serves as the last‑level buffer before main memory. Approximate access latencies are 2‑4 cycles for L1, 10‑20 cycles for L2, 20‑60 cycles for L3, and 200‑300 cycles for RAM.

Cache line

A cache line is the basic transfer unit between memory and any cache level; on modern x86 CPUs it is 64 bytes. All cache accesses operate on whole lines, which is why data alignment matters for performance.

Cache Write Policies

Write‑through

On a store the CPU first checks whether the line is already cached. If it is, the cache line is updated and the same data is written to main memory; if not, the store writes directly to memory. This guarantees coherence but incurs a memory write on every store.

Write‑back

On a store the CPU updates only the cache line and marks it “dirty”. The line is written back to memory only when it is evicted, reducing memory traffic.

Coherence challenges

With write‑back, different cores may hold stale copies of a line. The article illustrates a two‑core example where each core increments a shared variable x in its private cache; if both write back only once, the final value in memory increases by only one instead of two, demonstrating a classic cache‑coherence problem.

Coherence mechanisms

Bus snooping : each core monitors the bus for writes to shared addresses and invalidates or updates its copies accordingly.

MESI protocol : cache lines transition among Modified (M), Exclusive (E), Shared (S), and Invalid (I) states to coordinate visibility of writes.

Practical Cache Analysis

Using Valgrind Cachegrind

Install with sudo snap install valgrind and run: <code>valgrind --tool=cachegrind hostnamectl</code> Cachegrind creates a cachegrind.out file that can be annotated with cg_annotate . Sample excerpts:

Instruction references (Ir): 3,282,457

I1 misses: 4,095 (0.12% miss rate)

LL instruction misses: 3,184 (0.10% miss rate)

Data references (Dr): 852,198 reads, 355,999 writes

D1 miss rate: 3.6% (4.1% reads, 2.5% writes)

LL data miss rate: 2.1%

Function‑level analysis shows that ./elf/dl-lookup.c:do_lookup_x accounts for 23 % of instruction references with only 37 I1 misses, while ./elf/../sysdeps/x86_64/dl-machine.h:_dl_relocate_object has a higher D1 miss count (7,840) representing 22.58 % of its data reads.

Cache configuration observed

I1 cache: 32 KB, 64‑byte lines, 8‑way set associative

D1 cache: 32 KB, 64‑byte lines, 8‑way set associative

LL cache (L3): 8 MB, 64‑byte lines, 16‑way set associative

Using perf

Perf can count cache events directly: <code>sudo perf stat -e l1d.replacement,l1d_pend_miss.pending_cycles,l2_lines_in.all,l2_lines_out.non_silent hostnamectl</code> The sample run reports zero events for all four counters, indicating no L1 data replacements or L2 line traffic during the short execution of hostnamectl .

Conclusion

The majority of instruction fetches and data reads in the examined program are served from caches, but specific functions still exhibit noticeable miss rates that can be targeted for optimization. Understanding the distinction between physical and logical cores, the multi‑level cache hierarchy, cache‑line size, write policies, and coherence protocols is essential for effective performance tuning on Linux systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux performance analysis CPU cache cache coherence perf valgrind cache hierarchy

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.