Fundamentals 18 min read

How CPU Cache Works and How to Write Faster Code

Understanding CPU cache hierarchy, its speed advantages over memory, and the mechanics of cache lines, tags, and offsets reveals why code that maximizes cache hit rates—through sequential data access, branch prediction, and core affinity—can run dramatically faster on modern processors.

Liangxu Linux
Liangxu Linux
Liangxu Linux
How CPU Cache Works and How to Write Faster Code

Introduction

Code runs on the CPU, and the quality of the code directly influences CPU execution efficiency, especially for compute‑intensive programs. Because the CPU cache (a small, ultra‑fast memory located close to the core) can serve data orders of magnitude faster than main memory, writing code that cooperates with the cache can dramatically improve performance.

CPU Cache Speed and Levels

Typical access latencies are:

L1 cache: 2–4 clock cycles

L2 cache: 10–20 clock cycles

L3 cache: 20–60 clock cycles

Main memory: 200–300 clock cycles

Modern CPUs often have three cache levels. For example, a server might have 32 KB L1, 256 KB L2, and 3 MB L3 per core.

Cache Line and Direct‑Mapped Cache

Data is transferred between memory and the cache in fixed‑size blocks called cache lines , typically 64 bytes ( coherency_line_size on Linux).

In a direct‑mapped cache, each memory block is mapped to exactly one cache line using a modulo operation. The cache line stores:

Tag : identifies which memory block the line holds

Data : the actual bytes from memory

Valid bit : indicates whether the data is current

Offset : selects the specific word within the line

To read data from the cache the CPU performs four steps:

Compute the cache line index from the memory address.

Check the valid bit of the selected line.

Compare the tag stored in the line with the address tag.

Use the offset to extract the required word.

Other mapping strategies such as fully associative and set‑associative caches use similar structures but with more flexible line selection.

Writing Faster Code

Improving Data‑Cache Hit Rate

Accessing memory in the order it is laid out (row‑major for C‑style arrays) allows consecutive elements to be loaded into a cache line together, yielding high spatial locality.

Example: traversing a 2‑D array with array[i][j] is much faster than array[j][i] because the former follows the memory layout, causing the CPU to load a whole cache line (e.g., 64 bytes) and reuse it for the next few accesses.

Improving Instruction‑Cache Hit Rate

Branch prediction works best when the outcome of a conditional is predictable. Sorting data before a loop that contains an if test can make the branch highly predictable, allowing the CPU to keep the taken path in the instruction cache.

In C/C++ you can hint the compiler with the macros likely() and unlikely() to mark the expected outcome of a condition, though they should be used only when you are certain the prediction would otherwise be wrong.

Improving Multi‑Core Cache Locality

When a thread migrates between cores, each core’s private L1/L2 caches are flushed, reducing hit rates. Pinning a thread to a specific core (CPU affinity) keeps its working set in the same caches.

On Linux this can be done with sched_setaffinity:

cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask); // bind to core 0
sched_setaffinity(0, sizeof(mask), &mask);

Conclusion

Because the speed gap between CPU and memory has grown to several hundred times, modern processors rely on a multi‑level cache hierarchy to bridge the gap. Maximizing cache hit rates—by accessing data sequentially, writing predictable branches, and keeping threads on the same core—significantly speeds up code execution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Optimizationlow‑level programmingMemory HierarchyCPU cacheCache Hit Rate
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.