Fundamentals 12 min read

Understanding CPU Cache: Types, Structure, and Performance Optimization

This article explains why CPU caches are needed, describes the hierarchy and internal structure of cache lines, compares direct‑mapped, set‑associative and fully‑associative caches, and shows how cache‑aware coding can dramatically improve program execution speed.

Liangxu Linux

Dec 14, 2023

Understanding CPU Cache: Types, Structure, and Performance Optimization

Why a CPU cache is needed?

Modern computer architectures use a hierarchy of components—CPU registers, CPU cache, main memory, and storage—where speed decreases and capacity increases from registers to disk. Because main memory is much slower than the CPU, a fast intermediate storage called the CPU cache is introduced to bridge the speed gap.

Typical latency (in clock cycles) is: L1 cache 4, L2 cache 11, L3 cache 24, main memory 167.

Cache hierarchy and basic concepts

Most CPUs have three cache levels (L1, L2, L3). L1 is split into data and instruction caches and is private to each core. L2 is also private, while L3 is shared among all cores. When a requested address is not found in a higher‑level cache, the lookup proceeds to the next level, eventually falling back to main memory or storage.

Cache line and locality

Memory is fetched in blocks called cache lines. Exploiting spatial locality, a cache line brings adjacent data into the cache, increasing the chance of subsequent accesses hitting the cache.

Direct‑mapped cache

In a direct‑mapped cache each memory address is divided into Tag , Index , and Offset . The Index selects a single cache line; the Tag stored in that line is compared with the address Tag to determine a hit. If the Tags differ, a miss occurs and the line is replaced.

Example: with 8 cache lines, addresses 0x00, 0x40, and 0x80 share the same Index and map to the same line, causing repeated misses (cache thrashing) when accessed in sequence.

Set‑associative cache

Set‑associative caches allow each Index to refer to multiple (e.g., two) cache lines. On a lookup, both lines are examined for a matching Tag. This reduces thrashing compared to direct‑mapped caches.

Example: with an 8‑line, 2‑way set‑associative cache, a virtual address with Tag 0x19 and Index 1 finds two candidate lines; the second line’s Tag matches, resulting in a hit.

If all lines in a set are occupied, replacement policies such as PLRU, NRU, FIFO, or round‑robin are used.

Fully associative cache

In a fully associative cache any memory block can be placed in any cache line; there is no Index field. On a lookup, the Tag of every line is compared, guaranteeing the highest possible hit rate at the cost of the most complex hardware.

Writing cache‑friendly code

Access patterns that traverse data sequentially in memory make better use of cache lines. The article demonstrates this with a 1024 × 1024 matrix sum example.

const int row = 1024;
const int col = 1024;
int matrix[row][col];
// Row‑major traversal (fast)
int sum_row = 0;
for (int r = 0; r < row; ++r) {
    for (int c = 0; c < col; ++c) {
        sum_row += matrix[r][c];
    }
}
// Column‑major traversal (slow)
int sum_col = 0;
for (int c = 0; c < col; ++c) {
    for (int r = 0; r < row; ++r) {
        sum_col += matrix[r][c];
    }
}

Benchmarking on the same machine shows the row‑major version completing in about 2 ms, while the column‑major version takes about 8 ms, illustrating the impact of cache line utilization.

When iterating row‑wise, consecutive elements reside in the same cache line, so subsequent accesses hit the cache. Column‑wise iteration jumps to a different cache line for each element, causing frequent cache misses and slower execution.

Key takeaways

CPU caches dramatically reduce memory latency by storing frequently accessed data close to the processor.

Understanding cache line size, mapping schemes (direct‑mapped, set‑associative, fully associative), and replacement policies helps predict performance.

Writing code that accesses memory sequentially and respects cache line boundaries yields significant speedups.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Computer Architecture CPU cache C++Memory Access cache hierarchy Set Associative

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.