Fundamentals 30 min read

Boost Code Performance by Leveraging CPU Cache Principles

This article explains how CPU caches bridge the speed gap between the processor and main memory, describes cache hierarchy, locality principles, write policies, coherence protocols, and provides concrete C code examples and practical tips such as data alignment and loop restructuring to improve cache hit rates and overall program speed.

Linux Kernel Journey
Linux Kernel Journey
Linux Kernel Journey
Boost Code Performance by Leveraging CPU Cache Principles

1. CPU Cache Overview

Modern CPUs run at several GHz while memory accesses still take hundreds of clock cycles, creating a performance gap similar to a rabbit‑turtle race. CPU cache, a small high‑speed buffer between CPU and memory, reduces this gap by storing frequently accessed data.

1.1 What is a Cache?

Cache acts as a temporary “transfer station” that speeds up data reads. When the CPU needs data, it first checks the cache; a hit allows immediate access, while a miss forces a slower memory read.

1.2 Cache Levels

Typical CPUs have three levels: L1 (fastest, ~32 KB per core, split into data and instruction caches), L2 (larger, ~1 MB per core), and L3 (shared, up to dozens of MB). Data moves from L1 → L2 → L3 before reaching memory.

2. Core Principles

2.1 Locality of Reference

Two forms drive cache effectiveness:

Temporal locality : recently accessed data is likely to be accessed again soon (e.g., a loop variable sum in a simple C loop).

Spatial locality : data near a recently accessed address will likely be accessed next (e.g., consecutive array elements).

Cache lines (typically 64 bytes) are the smallest transfer unit; when a line is fetched, neighboring data is brought in as well.

2.2 Cache Hit vs. Miss

A hit lets the CPU retrieve data directly from cache, dramatically reducing latency. A miss requires a memory read, incurring many clock cycles and possibly triggering replacement algorithms such as LRU.

2.3 Cache Line Size

Cache line size balances utilization and bandwidth. A 64‑byte line can hold 16 int values; accessing one element brings the whole line into cache, improving subsequent accesses. Too small a line wastes spatial locality; too large a line reduces cache efficiency.

3. Write Policies

3.1 Write‑Through

Every write updates both cache and memory, guaranteeing consistency (useful for databases) but incurring high memory traffic and lower performance.

3.2 Write‑Back

Writes modify only the cache; the dirty line is written back to memory only when evicted. This reduces memory traffic and improves speed, but requires a dirty‑bit and may lose data on power loss unless mitigated by write buffers or logging.

4. Multi‑Core Cache Coherence

Each core has its own caches, leading to coherence problems when one core updates data that another core has cached.

4.1 Bus Snooping

All cores monitor the bus; a write request triggers invalidation or update messages to other cores. Simple but can saturate the bus as core count grows.

4.2 MESI Protocol

Cache lines can be in Modified (M), Exclusive (E), Shared (S), or Invalid (I) states. State transitions ensure that only one core holds the latest copy while others see a consistent view, reducing unnecessary bus traffic.

MESI state diagram
MESI state diagram

5. Practical Optimizations

5.1 Data Alignment

Aligning structures to their natural size (e.g., 4‑byte alignment) lets whole structures fit within a single cache line, reducing the number of lines fetched. Misaligned data may span two lines, causing extra memory accesses.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

// Unaligned struct
struct UnalignedStruct {
    char a;
    int b;
    char c;
};

// Aligned struct (4‑byte alignment)
struct __attribute__((aligned(4))) AlignedStruct {
    char a;
    int b;
    char c;
};

void testAccessTime(void *arr, size_t numElements, size_t structSize) {
    clock_t start, end;
    double cpu_time_used;
    int i;
    start = clock();
    for (i = 0; i < numElements; i++) {
        char *ptr = (char *)arr + i * structSize;
        char temp = *((char *)ptr);
        temp = *((int *)(ptr + 1));
        temp = *((char *)(ptr + 5));
    }
    end = clock();
    cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
    printf("Access time: %f seconds
", cpu_time_used);
}

int main() {
    size_t numElements = 1000000;
    struct UnalignedStruct *unalignedArr = (struct UnalignedStruct *)malloc(numElements * sizeof(struct UnalignedStruct));
    struct AlignedStruct *alignedArr = (struct AlignedStruct *)malloc(numElements * sizeof(struct AlignedStruct));
    if (unalignedArr == NULL || alignedArr == NULL) {
        perror("malloc failed");
        return 1;
    }
    printf("Testing unaligned struct:
");
    testAccessTime(unalignedArr, numElements, sizeof(struct UnalignedStruct));
    printf("Testing aligned struct:
");
    testAccessTime(alignedArr, numElements, sizeof(struct AlignedStruct));
    free(unalignedArr);
    free(alignedArr);
    return 0;
}

The benchmark shows the aligned version runs noticeably faster, illustrating the impact of proper alignment on cache efficiency.

5.2 Loop Optimization

Loops dominate program execution; their access patterns affect cache hit rates.

Reduce nesting depth to lower data‑access complexity.

Arrange loops to follow memory layout (row‑major for C arrays) to exploit spatial locality.

for (int i = 0; i < ROWS; i++) {
    for (int j = 0; j < COLS; j++) {
        sum += matrix[i][j]; // Row‑major access – good cache behavior
    }
}

Conversely, column‑major traversal forces frequent cache line changes and increases miss rates.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance TuningMemory HierarchyCPU cachecache optimizationcache coherenceMESI Protocol
Linux Kernel Journey
Written by

Linux Kernel Journey

Linux Kernel Journey

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.