Fundamentals 15 min read

Understanding Memory Hierarchy and Cache: Principles of Locality and Cache‑Friendly Programming

This article explains the memory hierarchy—from registers to disks—covers cache fundamentals, the principle of locality, cache organization (including direct‑mapped caches), and provides guidelines and example code for writing cache‑friendly programs to improve performance.

Top Architect

Oct 29, 2022

Understanding Memory Hierarchy and Cache: Principles of Locality and Cache‑Friendly Programming

Overview

In a simple computer system model, the memory system can be viewed as a linear byte array that the CPU can access in constant time. In reality, memory consists of a hierarchy of devices with varying capacity, cost, and latency: fast registers, small high‑speed cache memory close to the CPU, larger main memory (DRAM), and even larger, slower disks, possibly remote over a network.

Cache Basic Model

The CPU fetches instructions and data from main memory over a bus, but the speed gap between the CPU and main memory creates a bottleneck, causing many cycles to be wasted on memory accesses. To mitigate this, a cache layer is inserted between the CPU and main memory, reducing the number of slow memory accesses.

Cache stores frequently or recently accessed instructions and data, offering faster access at the cost of smaller capacity.

Principle of Locality

Cache works well because most programs exhibit locality: they tend to reference data near other recently accessed data (spatial locality) or the same data repeatedly within a short time (temporal locality). Temporal locality means a memory address accessed once is likely to be accessed again soon; spatial locality means nearby addresses are likely to be accessed together.

Programs exploit locality by accessing loop variables (temporal) and arrays with unit stride (spatial). Example data‑side observations:

Variable sum is accessed each loop iteration (temporal locality).

Array a is accessed with stride 1 (spatial locality).

Instruction‑side observations:

Loop iterations exhibit temporal locality.

Linear instruction flow exhibits spatial locality.

Writing programs with good locality improves execution speed.

Memory Hierarchy

The hierarchy progresses from fast, small, expensive devices to slow, large, cheap ones: CPU registers (single‑cycle access), SRAM caches (few cycles), DRAM main memory (tens to hundreds of cycles), local disks (much slower), and optionally remote network storage accessed via protocols like NFS.

<p style="text-align:center"><img src="https://mmbiz.qpic.cn/mmbiz_png/b96CibCt70iaajvl7fD4ZCicMcjhXMp1v6UXcBIdyPOYfUfaibicSXbZSrkFL9jwZRTUz8aKA1zXq1GDV8Z7uXeep6w/640?wx_fmt=png" width="20"/><img src="https://mmbiz.qpic.cn/mmbiz_png/b96CibCt70iaajvl7fD4ZCicMcjhXMp1v6UXcBIdyPOYfUfaibicSXbZSrkFL9jwZRTUz8aKA1zXq1GDV8Z7uXeep6w/640?wx_fmt=png" width="20"/><img src="https://mmbiz.qpic.cn/mmbiz_png/b96CibCt70iaajvl7fD4ZCicMcjhXMp1v6UXcBIdyPOYfUfaibicSXbZSrkFL9jwZRTUz8aKA1zXq1GDV8Z7uXeep6w/640?wx_fmt=png" width="20"/><img src="https://mmbiz.qpic.cn/mmbiz_png/b96CibCt70iaajvl7fD4ZCicMcjhXMp1v6UXcBIdyPOYfUfaibicSXbZSrkFL9jwZRTUz8aKA1zXq1GDV8Z7uXeep6w/640?wx_fmt=png" width="20"/></p><p><a href="http://mp.weixin.qq.com/s?__biz=MzU1NjYxNDc2OQ==&mid=2247500619&idx=1&sn=58b88975efa9ea0b35e6190b8a469481&chksm=fbc0d655ccb75f435e1ebbddf6b72d0f61d7947d677a65198e0f20da4d2ead1c0950eca0994a&scene=21#wechat_redirect">牛逼啊！接私活必备的 N 个开源项目！赶快收藏</a></p>

Each level k acts as a cache for the slower level k+1, storing a subset of its blocks. Data moves between levels in block‑sized units; block size may differ between adjacent levels (e.g., tens of bytes between L1/L2, hundreds of bytes between higher levels).

Cache Memory

Early computers had three levels: registers, DRAM main memory, and disk. As the CPU‑memory speed gap widened, designers added a small SRAM L1 cache, then a larger SRAM L2 cache.

Assume an address space of m bits (M = 2^m addresses). A cache is organized as S = 2^s sets, each containing E lines. Each line holds a block of B = 2^b bytes, a valid bit, and t = m‑(s+b) tag bits.

When E = 1 the cache is direct‑mapped. The access process consists of:

Set selection – extract s index bits from the address.

Line match – compare the tag bits t with the stored tag (valid bit must be set).

Word extraction – determine the byte offset within the block.

If a miss occurs, the required block is fetched from the next level and replaces the existing line (replacement policy for direct‑mapped caches is simply overwrite).

Writing Cache‑Friendly Code

Two basic strategies:

Make the common case fast.

Minimize cache misses inside loops.

int sumvec(int v[N]) {
    int i, sum = 0;
    for (i = 0; i < N; i++) {
        sum += v[i];
    }
    return sum;
}

The loop variable and accumulator have good temporal locality; accessing the array with stride 1 gives good spatial locality, leading to a high cache‑hit rate.

Conclusion

Understanding the memory hierarchy is crucial for performance: data in registers is accessed in 0 cycles, in cache in ~4‑75 cycles, in main memory in hundreds of cycles, and on disk in millions of cycles. By placing frequently used data higher in the hierarchy, programmers can significantly speed up execution.

References

This article is based on the University of Washington course “The Hardware/Software Interface” (CS:APP – “Computer Systems: A Programmer’s Perspective”).

https://courses.cs.washington.edu/courses/cse351/17wi/videos.html

https://book.douban.com/subject/26912767/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cache Computer Architecture Memory Hierarchy cache-friendly programming locality principle

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.