Fundamentals 19 min read

Why Row‑Major Traversal Beats Column‑Major: Unveiling Cache, Prefetch, and False‑Sharing Secrets

This article builds a practical hardware‑mind model by benchmarking Rust code to show how cache layout, prefetching, cache associativity, false sharing, pipeline stalls, and data dependencies affect the performance of row‑major versus column‑major traversals, random accesses, and multithreaded loops, and it offers concrete fixes such as cache‑line alignment.

dbaplus Community

Dec 22, 2024

Why Row‑Major Traversal Beats Column‑Major: Unveiling Cache, Prefetch, and False‑Sharing Secrets

Cache fundamentals and benchmark goal

Modern CPUs read memory in cache lines (typically 64 bytes). When a program accesses a memory location, the CPU first tries to fetch the corresponding cache line; a miss forces a load from main memory, which is orders of magnitude slower. The article demonstrates these effects with a series of micro‑benchmarks written in Rust.

Row‑major vs. column‑major traversal

Two functions iterate over a square 2‑D vector stored as Vec<Vec<usize>>:

pub fn row_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            arr[i][j] += j;
        }
    }
}

pub fn column_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            arr[j][i] += j;
        }
    }
}

Because Rust’s Vec<Vec<usize>> stores rows contiguously, row_major_traversal accesses memory sequentially and stays within the same cache line, while column_major_traversal jumps to a new line on each inner iteration, causing many cache misses. Benchmarks show roughly a ten‑fold speed advantage for the row‑major version.

Random access and hardware prefetcher

A third variant picks a random column on each inner loop:

pub fn random_access(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            let ri: usize = rand::random();
            arr[j][ri % n] += j;
        }
    }
}

Random access defeats the hardware prefetcher, which tries to predict future accesses from the instruction stream. The benchmark shows random access is even slower than column‑major traversal because the CPU cannot load useful cache lines ahead of time.

Cache associativity

Three mapping schemes are relevant:

Fully associative : any memory line may occupy any cache line (flexible but slower lookup).

Direct‑mapped : each memory line maps to exactly one cache line (fast lookup but prone to conflict misses).

Set‑associative : the cache is divided into sets; a memory line maps to a specific set, and within that set any line can be used. This balances speed and flexibility.

Example: an AMD Ryzen 7 4700U has an 8‑way set‑associative L1 cache of 128 KB with 64‑byte lines, giving 2048 cache lines and 256 sets. When a stride equals a power of two that matches the number of sets, accesses may repeatedly map to the same set, increasing conflict misses.

False sharing

Four threads increment atomic counters. The first version shares a single AtomicUsize among all threads ( share); the second version uses four separate atomics ( false_share), which one would expect to be faster. However, the benchmark shows the opposite because the four atomics are placed on the same cache line, causing false sharing.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::thread;

fn increase(v: &AtomicUsize) {
    for _ in 0..10_000 {
        v.fetch_add(1, Ordering::Relaxed);
    }
}

pub fn share() {
    let v = AtomicUsize::new(0);
    thread::scope(|s| {
        let t1 = s.spawn(|| increase(&v));
        let t2 = s.spawn(|| increase(&v));
        let t3 = s.spawn(|| increase(&v));
        let t4 = s.spawn(|| increase(&v));
        t1.join().unwrap(); t2.join().unwrap(); t3.join().unwrap(); t4.join().unwrap();
    });
}

pub fn false_share() {
    let a = AtomicUsize::new(0);
    let b = AtomicUsize::new(0);
    let c = AtomicUsize::new(0);
    let d = AtomicUsize::new(0);
    thread::scope(|s| {
        let t1 = s.spawn(|| increase(&a));
        let t2 = s.spawn(|| increase(&b));
        let t3 = s.spawn(|| increase(&c));
        let t4 = s.spawn(|| increase(&d));
        t1.join().unwrap(); t2.join().unwrap(); t3.join().unwrap(); t4.join().unwrap();
    });
}

Solution: align each atomic to a separate cache line using #[repr(align(64))] (or the platform‑specific equivalent).

#[repr(align(64))]
struct AlignAtomicUsize { v: AtomicUsize }

impl AlignAtomicUsize {
    fn new(val: usize) -> Self { Self { v: AtomicUsize::new(val) } }
}

fn increase_2(v: &AlignAtomicUsize) {
    for _ in 0..10_000 {
        v.v.fetch_add(1, Ordering::Relaxed);
    }
}

pub fn non_share() {
    let a = AlignAtomicUsize::new(0);
    let b = AlignAtomicUsize::new(0);
    let c = AlignAtomicUsize::new(0);
    let d = AlignAtomicUsize::new(0);
    thread::scope(|s| {
        let t1 = s.spawn(|| increase_2(&a));
        let t2 = s.spawn(|| increase_2(&b));
        let t3 = s.spawn(|| increase_2(&c));
        let t4 = s.spawn(|| increase_2(&d));
        t1.join().unwrap(); t2.join().unwrap(); t3.join().unwrap(); t4.join().unwrap();
    });
}

After alignment the benchmark shows almost a two‑fold speedup.

Pipeline, branch prediction and data dependency

Iterating over a vector of boxed trait objects when the vector is sorted versus when it is shuffled demonstrates the impact of branch prediction. The shuffled case is about 2.7× slower because each iteration incurs a mispredicted branch, which flushes the pipeline.

pub trait Get { fn get(&self) -> i32; }

struct Foo; impl Get for Foo { fn get(&self) -> i32 { 1 } }
struct Bar; impl Get for Bar { fn get(&self) -> i32 { 2 } }

let mut arr: Vec<Box<dyn Get>> = Vec::new();
for _ in 0..10_000 { arr.push(Box::new(Foo)); }
for _ in 0..10_000 { arr.push(Box::new(Bar)); }

// Sorted order
let count_sorted = arr.iter().filter(|v| v.get() > 0).count();

// Shuffle and test again
use rand::seq::SliceRandom;
arr.shuffle(&mut rand::thread_rng());
let count_shuffled = arr.iter().filter(|v| v.get() > 0).count();

Modern CPUs use a pipeline that keeps several instruction stages active simultaneously. When a branch is mispredicted, the pipeline must be cleared and refilled, incurring a large penalty.

Data‑dependency example

pub fn dependent(a: &mut Vec<i32>, b: &mut Vec<i32>, c: &Vec<i32>) {
    for i in 0..=99_998 {
        a[i] += b[i];
        b[i + 1] += c[i];
    }
    a[9_999] += b[9_999];
}

pub fn independent(a: &mut Vec<i32>, b: &mut Vec<i32>, c: &Vec<i32>) {
    a[0] += b[0];
    for i in 0..=99_998 {
        b[i + 1] += c[i];
        a[i + 1] += b[i + 1];
    }
}

In dependent each iteration depends on the result of the previous one, creating a long data‑dependency chain that stalls the pipeline and prevents SIMD vectorization. The independent version removes that chain, allowing the compiler to generate SIMD instructions (shown in the original assembly). Benchmarks report up to a three‑fold speed improvement.

The same pattern in C++ (using int and std::vector) exhibits the same effect; the compiler can emit movdqu, paddd, and other SIMD instructions for the independent version.

Key takeaways

Iterate over memory in the order it is laid out (row‑major for row‑contiguous arrays) to maximise cache line reuse.

Random or irregular access patterns defeat hardware prefetchers and dramatically increase latency.

Cache associativity determines how often conflict misses occur; set‑associative caches provide a practical compromise.

False sharing occurs when distinct variables share a cache line; aligning data to cache‑line boundaries eliminates the contention.

Branch mispredictions flush the pipeline; keeping hot loops predictable (e.g., sorted data) preserves instruction‑level parallelism.

Removing intra‑iteration data dependencies enables compiler vectorization and yields large performance gains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance rust pipeline false sharing CPU architecture

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.