Understanding Linux CPU Caches: From Physical Cores to Cache Coherence
This article explains Linux CPU cache architecture—from physical and logical cores, through L1/L2/L3 hierarchy and cache‑line basics, to write‑through/write‑back policies and coherence mechanisms—while demonstrating practical analysis with Valgrind and perf tools.
Physical and Logical Cores
On a dual‑CPU Linux system each physical core exposes two logical cores via Hyper‑Threading, so the OS reports more cores than physically exist. The command cat /proc/cpuinfo | grep "core id" produces two distinct core id values (0 and 1), indicating two physical CPUs. Detailed output of cat /proc/cpuinfo shows two physical CPUs (ids 0 and 1) each with two physical cores, for a total of four physical cores. The lscpu command reports eight logical cores because each physical core is split into two logical ones.
CPU Cache Hierarchy
Why caches exist
Moore's Law doubles transistor counts roughly every 18‑24 months, but memory speed improves far more slowly, creating a 200‑300× latency gap between CPU and main memory. L1‑L3 caches bridge this gap by keeping frequently accessed data close to the core.
Multi‑level cache design
L1 is the smallest and fastest (typically 32 KB data + 32 KB instruction per core) and is split into data and instruction caches. L2 is larger (e.g., 256 KB) with slightly higher latency and is private to each core. L3 is shared among all cores (e.g., 8 MB) and serves as the last‑level buffer before main memory. Approximate access latencies are 2‑4 cycles for L1, 10‑20 cycles for L2, 20‑60 cycles for L3, and 200‑300 cycles for RAM.
Cache line
A cache line is the basic transfer unit between memory and any cache level; on modern x86 CPUs it is 64 bytes. All cache accesses operate on whole lines, which is why data alignment matters for performance.
Cache Write Policies
Write‑through
On a store the CPU first checks whether the line is already cached. If it is, the cache line is updated and the same data is written to main memory; if not, the store writes directly to memory. This guarantees coherence but incurs a memory write on every store.
Write‑back
On a store the CPU updates only the cache line and marks it “dirty”. The line is written back to memory only when it is evicted, reducing memory traffic.
Coherence challenges
With write‑back, different cores may hold stale copies of a line. The article illustrates a two‑core example where each core increments a shared variable x in its private cache; if both write back only once, the final value in memory increases by only one instead of two, demonstrating a classic cache‑coherence problem.
Coherence mechanisms
Bus snooping : each core monitors the bus for writes to shared addresses and invalidates or updates its copies accordingly.
MESI protocol : cache lines transition among Modified (M), Exclusive (E), Shared (S), and Invalid (I) states to coordinate visibility of writes.
Practical Cache Analysis
Using Valgrind Cachegrind
Install with sudo snap install valgrind and run: <code>valgrind --tool=cachegrind hostnamectl</code> Cachegrind creates a cachegrind.out file that can be annotated with cg_annotate . Sample excerpts:
Instruction references (Ir): 3,282,457
I1 misses: 4,095 (0.12% miss rate)
LL instruction misses: 3,184 (0.10% miss rate)
Data references (Dr): 852,198 reads, 355,999 writes
D1 miss rate: 3.6% (4.1% reads, 2.5% writes)
LL data miss rate: 2.1%
Function‑level analysis shows that ./elf/dl-lookup.c:do_lookup_x accounts for 23 % of instruction references with only 37 I1 misses, while ./elf/../sysdeps/x86_64/dl-machine.h:_dl_relocate_object has a higher D1 miss count (7,840) representing 22.58 % of its data reads.
Cache configuration observed
I1 cache: 32 KB, 64‑byte lines, 8‑way set associative
D1 cache: 32 KB, 64‑byte lines, 8‑way set associative
LL cache (L3): 8 MB, 64‑byte lines, 16‑way set associative
Using perf
Perf can count cache events directly: <code>sudo perf stat -e l1d.replacement,l1d_pend_miss.pending_cycles,l2_lines_in.all,l2_lines_out.non_silent hostnamectl</code> The sample run reports zero events for all four counters, indicating no L1 data replacements or L2 line traffic during the short execution of hostnamectl .
Conclusion
The majority of instruction fetches and data reads in the examined program are served from caches, but specific functions still exhibit noticeable miss rates that can be targeted for optimization. Understanding the distinction between physical and logical cores, the multi‑level cache hierarchy, cache‑line size, write policies, and coherence protocols is essential for effective performance tuning on Linux systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
