Fundamentals 12 min read

Why CPU Cache Matters: Unlock Faster Code Execution

This article explains the purpose of CPU caches, their hierarchical structure and internal designs—including direct‑mapped, set‑associative, and fully‑associative caches—and demonstrates how understanding cache behavior can dramatically improve program performance, illustrated with C++ traversal benchmarks.

Open Source Linux

Aug 17, 2023

Why CPU Cache Matters: Unlock Faster Code Execution

Why CPU Cache Is Needed

Modern computer architecture uses a memory hierarchy—CPU registers, CPU cache, main memory, and disk—where speed decreases and capacity increases from registers to disk, balancing cost and performance. Because disk access is much slower than memory, components like Redis or Memcached are used to accelerate data retrieval, and the speed gap between CPU and memory led to the introduction of CPU caches.

The following table shows typical latency (in clock cycles) for each level:

Storage Type | Clock Cycles
L1 cache    | 4
L2 cache    | 11
L3 cache    | 24
Memory      | 167

Most CPUs have three cache levels (L1, L2, L3). L1 is split into data and instruction caches, L2 is private to each core, and L3 is shared among all cores. If data is not found in any cache level, it is fetched from main memory, and ultimately from disk if necessary.

Internal Structure of CPU Cache

CPU cache reads memory in blocks called cache lines , exploiting spatial locality: when a word is accessed, adjacent words are likely to be accessed soon and are therefore loaded into the cache.

Cache lines can be organized as:

Direct‑mapped cache

Set‑associative cache

Fully‑associative cache

Direct‑Mapped Cache

A memory address is divided into Tag , Index , and Offset . The Index selects a single cache line; the Tag is compared to verify a hit. If the Tag matches, the data is retrieved; otherwise a miss occurs and the line is replaced.

Example: with 8 cache lines, addresses 0x00, 0x40, and 0x80 share the same Index and map to the same line, causing cache thrashing when accessed repeatedly.

Set‑Associative Cache (Two‑Way Example)

Instead of a single line, the Index points to a set containing multiple lines (e.g., two). Both lines are examined; a hit occurs if either Tag matches. Replacement policies such as PLRU, NRU, FIFO, or round‑robin decide which line to evict on a miss.

Example: with an 8‑line, two‑way cache and address 0x19 (Tag) / index 1, the first line’s Tag (0x10) misses, while the second line’s Tag (0x19) hits.

Fully‑Associative Cache

All cache lines belong to a single set; the Index is omitted. On a lookup, every line’s Tag is compared until a match is found. This eliminates cache thrashing but requires the most hardware resources.

Writing Cache‑Friendly Code

Consider summing a 1024×1024 matrix. Two traversal orders are compared:

const int row = 1024;
const int col = 1024;
int matrix[row][col];
// Row‑major traversal
int sum_row = 0;
for (int r = 0; r < row; ++r)
    for (int c = 0; c < col; ++c)
        sum_row += matrix[r][c];
// Column‑major traversal
int sum_col = 0;
for (int c = 0; c < col; ++c)
    for (int r = 0; r < row; ++r)
        sum_col += matrix[r][c];

Benchmark results (using std::chrono) show row‑major traversal completes in ~2 ms, while column‑major takes ~8 ms.

#include <chrono>
#include <iostream>
const int row = 1024;
const int col = 1024;
int matrix[row][col];
int main() {
    for (int r = 0; r < row; ++r)
        for (int c = 0; c < col; ++c)
            matrix[r][c] = r + c;
    auto start = std::chrono::steady_clock::now();
    int sum = 0;
    for (int r = 0; r < row; ++r)
        for (int c = 0; c < col; ++c)
            sum += matrix[r][c];
    auto finish = std::chrono::steady_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(finish - start);
    std::cout << duration.count() << "ms" << std::endl;
}

Row‑major traversal benefits from spatial locality: consecutive elements reside in the same cache line, leading to high cache‑hit rates. Column‑major traversal jumps to a different row each step, causing frequent cache misses and slower performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Tuning CPU cache Cache Optimization cache hierarchy memory locality

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.