Fundamentals 14 min read

Understanding CPU Cache: Purpose, Multi‑Level Design, Cache Lines, and Optimization Techniques

This article explains why CPU caches are needed, the evolution to multi‑level caches, the concept of cache lines, practical experiments demonstrating their impact, and how different cache organization strategies such as fully associative, direct‑mapped, and N‑way set‑associative affect performance and eviction policies.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Understanding CPU Cache: Purpose, Multi‑Level Design, Cache Lines, and Optimization Techniques

CPU cache exists because the rapid increase in CPU frequencies over recent decades has created a growing speed gap between the processor and DRAM memory, often reaching tens of thousands of times, making direct memory access a bottleneck.

To bridge this gap, a relatively fast and costly SDRAM layer—CPU cache—is placed between the CPU and main memory, providing high‑performance storage for frequently accessed data.

As hot data volumes grew, a single‑level cache became insufficient, leading to the introduction of a second‑level (L2) cache that sits between L1 cache and memory, offering a balance of speed and cost.

Modern CPUs also split L1 into separate instruction (L1i) and data (L1d) caches to handle differing access patterns.

A cache line is the smallest unit of data stored in a CPU cache, typically 64 bytes. For example, a 512‑byte L1 cache can hold eight cache lines.

To observe cache‑line effects, the following C program creates an integer array of size N (provided via command line) and accesses it one billion times, measuring total execution time.

#include "stdio.h"
#include <stdlib.h>
#include <sys/time.h>

long timediff(clock_t t1, clock_t t2) {
    long elapsed;
    elapsed = ((double)t2 - t1) / CLOCKS_PER_SEC * 1000;
    return elapsed;
}

int main(int argc, char *argv[])
{
    int array_size = atoi(argv[1]);
    int repeat_times = 1000000000;
    long array[array_size];
    for(int i=0; i<array_size; i++) {
        array[i] = 0;
    }
    int j=0, k=0, c=0;
    clock_t start = clock();
    while(j++ < repeat_times) {
        if(k == array_size) k = 0;
        c = array[k++];
    }
    clock_t end = clock();
    printf("%lu\n", timediff(start, end));
    return 0;
}

When the array size exceeds 64 bytes (the cache‑line size), execution time shows a noticeable increase because accesses span multiple cache lines, causing additional line fills.

Understanding cache lines helps developers write cache‑friendly code, such as preferring row‑major order in two‑dimensional arrays to keep accesses within the same line.

Example of loop ordering:

for(int i = 0; i < n; i++) {
    for(int j = 0; j < n; j++) {
        arr[i][j] = num; // faster
    }
}

for(int i = 0; i < n; i++) {
    for(int j = 0; j < n; j++) {
        arr[j][i] = num; // slower due to cache‑line crossing
    }
}

Cache storage designs:

Hash‑based cache: Uses a hash of the physical address as an index; simple but unrealistic for hardware due to high latency.

Fully associative: Any memory block can occupy any cache line, requiring a full scan to check presence, which is too slow for hardware.

Direct‑mapped: Each address maps to exactly one line, leading to poor utilization when many addresses share the same line (e.g., low‑indexed lines become hot while others stay idle).

N‑Way Set Associative: Groups of N lines (sets) share an index; a typical 16‑way set associative cache reduces conflict misses and balances speed and complexity.

Set index bits are taken from the lower address bits because most accesses are spatially close; using higher bits would cause many accesses to map to the same set, increasing conflict misses.

Cache eviction policies are typically LRU, which generally yields higher hit rates than random replacement, though random can perform better for very large caches.

In summary, while CPU caches operate transparently, understanding their architecture—cache lines, multi‑level hierarchy, associativity, and eviction policies—enables developers to write more cache‑efficient programs.

References:

Gallery of Processor Cache Effects

How Misaligning Data Can Increase Performance 12x by Reducing Cache Misses

Introduction to Caches

Performance Optimizationmemory hierarchyCPU cachecache linecache architecture
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.