Fundamentals 14 min read

Why Caches Matter: Unlocking CPU Performance with Memory Hierarchy

This article explains how memory hierarchy and cache design, guided by the principle of locality, reduce the CPU‑memory speed gap, and provides practical guidelines for writing cache‑friendly code to dramatically improve program performance.

Architect's Must-Have

Sep 15, 2025

Why Caches Matter: Unlocking CPU Performance with Memory Hierarchy

Overview

For a simple computer system model, the memory system can be viewed as a linear byte array that the CPU can access in constant time. In reality, memory is a hierarchy of devices with varying capacity, cost, and latency. CPU registers hold the most frequently used data, a small fast cache sits close to the CPU, and slower main memory buffers data and instructions. Larger, slower disks act as caches for data stored in main memory, and remote disks can serve as caches over the network.

Cache Basic Model

CPU fetches instructions and data from main memory via the bus, writes results back after computation. The speed gap between the ultra‑fast CPU and relatively slow main memory creates a bottleneck, so a cache layer is inserted between them.

Cache stores frequently or recently accessed instructions and data. Because cache is expensive, its capacity is much smaller than main memory. Adding a cache layer helps alleviate the CPU‑memory bottleneck.

Principle of Locality

You may wonder why adding a cache helps. The principle of locality—temporal and spatial—means programs tend to reuse recently accessed data or access nearby addresses. Temporal locality: a memory address accessed once is likely to be accessed again soon. Spatial locality: accessing one address makes nearby addresses likely to be accessed.

How programs exploit locality:

Data side: the variable sum is accessed each loop iteration (temporal locality); array a is traversed with stride 1 (spatial locality).

Instruction side: loop iteration exhibits temporal locality; linear instruction flow exhibits spatial locality.

Writing code with good locality improves performance.

Memory Hierarchy

The diagram shows a typical memory hierarchy: at the top are a few CPU registers (single‑cycle access), followed by one or more SRAM caches (few cycles), a large DRAM main memory (tens to hundreds of cycles), slower but large local disks, and optionally remote disks accessed over a network (e.g., NFS).

Each level acts as a cache for the next slower level. Memory is divided into blocks; a block from a lower level is cached in the upper level. Block size may differ between adjacent levels (e.g., L1‑L2 use dozens of bytes, higher levels use larger blocks).

When a request for data at level k+1 arrives, the system first checks level k. A hit returns the data quickly; a miss causes the block to be fetched from level k+1, possibly evicting an existing block according to a replacement policy such as random or LRU.

Cache Memory

Early computers had three levels: registers, DRAM main memory, and disk. As the CPU‑memory speed gap grew, designers added an SRAM L1 cache, then an SRAM L2 cache between L1 and main memory.

Assume an address space of m bits (M = 2^m addresses). A cache consists of S = 2^s sets, each containing E lines. Each line holds a block of B = 2^b bytes, a valid bit, and t = m‑(s+b) tag bits.

With E = 1 the cache is direct‑mapped. The following example illustrates a direct‑mapped cache.

When the CPU reads a word w, it checks the cache. If the line is valid and the tag matches, a hit occurs; otherwise the block is fetched from the next level, possibly evicting an existing block.

Group selection – extract s index bits from w.

Line match – compare tag bits t with the line’s tag.

Word extraction – determine the byte offset within the block.

If a miss occurs, the fetched block replaces the existing line (the replacement policy for a direct‑mapped cache is simply to overwrite).

Writing Cache‑Friendly Code

Two basic strategies:

Make the common case fast.

Minimize cache misses inside loops.

int sumvec(int v[N]) {
    int i, sum = 0;
    for (i = 0; i < N; i++) {
        sum += v[i];
    }
    return sum;
}

The loop variable and sum have good temporal locality; the array is accessed with stride 1, giving good spatial locality. After the first miss, subsequent accesses hit the cache.

Conclusion

Understanding the memory hierarchy is crucial for performance. Data in registers can be accessed in 0 cycles, in cache in 4‑75 cycles, in main memory in hundreds of cycles, and on disk in millions of cycles. By placing frequently used data higher in the hierarchy, programmers can significantly reduce latency.

References

This note is based on the University of Washington course “The Hardware/Software Interface” and the textbook CSAPP (Computer Systems: A Programmer’s Perspective).

https://courses.cs.washington.edu/courses/cse351/17wi/videos.html

https://book.douban.com/subject/26912767/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization computer architecture Memory Hierarchy Cache Memory principle of locality

Written by

Architect's Must-Have

Professional architects sharing high‑quality architecture insights. Covers high‑availability, high‑performance, high‑stability designs, big data, machine learning, Java, system, distributed and AI architectures, plus internet‑driven architectural adjustments and large‑scale practice. Open to idea‑driven, sharing architects for exchange and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.