Fundamentals 17 min read

Cache, Prefetching, False Sharing, Pipeline and Data Dependency: Performance Optimization in Rust

The article uses Rust benchmarks to show how cache layout, prefetching, associativity, false sharing, pipeline stalls, and loop data dependencies impact performance, and demonstrates practical optimizations such as row‑major traversal, proper alignment, avoiding dependent loops, and leveraging sequential access to achieve near‑optimal speed.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Cache, Prefetching, False Sharing, Pipeline and Data Dependency: Performance Optimization in Rust

When striving for high‑performance code, developers often encounter bottlenecks that require a solid understanding of hardware behavior such as CPU caches, cache lines, prefetchers, and pipelines.

This article presents a series of benchmarks written in Rust to illustrate common optimization details and the underlying hardware concepts.

1. Cache and Traversal Order

Two traversal functions are compared: row_major_traversal (row‑wise) and column_major_traversal (column‑wise). Benchmarks show that row‑wise iteration is about ten times faster because successive elements lie on the same cache line, while column‑wise accesses cause frequent cache misses.

pub fn row_major_traversal(arr: &mut Vec
>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            arr[i][j] += j;
        }
    }
}

pub fn column_major_traversal(arr: &mut Vec
>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            arr[j][i] += j;
        }
    }
}

2. Prefetcher

When the access pattern is random, both random_access and column‑wise traversal suffer similar cache‑miss rates, confirming that hardware prefetchers cannot predict the accesses.

pub fn random_access(arr: &mut Vec
>) {
    let n = arr.len();
    for i in 0..n {
        for j in 0..n {
            let ri: usize = rand::random();
            arr[j][ri % n] += j;
        }
    }
}

Hardware prefetching works well for sequential streams, while software prefetching can be inserted by the compiler.

3. Cache Associativity

The article explains the three mapping schemes: fully‑associative, direct‑mapped, and n‑way set‑associative. Using an AMD Ryzen 7 4700U with a 4 × 32 KB L1 cache (8‑way set associative, 64‑byte line), benchmarks reveal performance spikes when the iteration step size aligns with the number of sets, causing more cache‑set conflicts.

4. False Sharing

Four threads increment separate atomic variables. Without proper alignment, the variables share a cache line, leading to severe false‑sharing penalties. Aligning each variable to a 64‑byte boundary ( #[repr(align(64))] ) eliminates the contention and nearly doubles performance.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::thread;

fn increase(v: &AtomicUsize) {
    for _ in 0..10000 {
        v.fetch_add(1, Ordering::Relaxed);
    }
}

pub fn share() {
    let v = AtomicUsize::new(0);
    thread::scope(|s| {
        let t1 = s.spawn(|| increase(&v));
        let t2 = s.spawn(|| increase(&v));
        let t3 = s.spawn(|| increase(&v));
        let t4 = s.spawn(|| increase(&v));
        t1.join().unwrap();
        t2.join().unwrap();
        t3.join().unwrap();
        t4.join().unwrap();
    });
}

#[repr(align(64))]
struct AlignAtomicUsize { v: AtomicUsize }

pub fn non_share() {
    let a = AlignAtomicUsize::new(0);
    let b = AlignAtomicUsize::new(0);
    let c = AlignAtomicUsize::new(0);
    let d = AlignAtomicUsize::new(0);
    thread::scope(|s| {
        let t1 = s.spawn(|| increase(&a.v));
        let t2 = s.spawn(|| increase(&b.v));
        let t3 = s.spawn(|| increase(&c.v));
        let t4 = s.spawn(|| increase(&d.v));
        t1.join().unwrap();
        t2.join().unwrap();
        t3.join().unwrap();
        t4.join().unwrap();
    });
}

5. Pipeline and Branch Prediction

Modern CPUs execute instructions in a pipeline. When a branch prediction fails, the pipeline must be flushed, causing a noticeable slowdown. A benchmark that shuffles a vector of trait objects demonstrates a 2.67× slowdown due to frequent mispredictions.

let mut arr: Vec
> = vec![];
for _ in 0..10000 { arr.push(Box::new(Foo::new())); }
for _ in 0..10000 { arr.push(Box::new(Bar::new())); }
let sorted = arr.iter().filter(|v| v.get() > 0).count();
arr.shuffle(&mut rand::thread_rng());
let unsorted = arr.iter().filter(|v| v.get() > 0).count();

6. Data Dependency

Two functions, dependent and independent , illustrate how loop‑carried dependencies prevent vectorization and reduce pipeline efficiency. Benchmarks show the dependent version is roughly three times slower, and the independent version can be auto‑vectorized by the compiler.

pub fn dependent(a: &mut Vec
, b: &mut Vec
, c: &Vec
) {
    for i in 0..=99998 {
        a[i] += b[i];
        b[i + 1] += c[i];
    }
    a[9999] += b[9999];
}

pub fn independent(a: &mut Vec
, b: &mut Vec
, c: &Vec
) {
    a[0] += b[0];
    for i in 0..=99998 {
        b[i + 1] += c[i];
        a[i + 1] += b[i + 1];
    }
}

Overall, the article demonstrates how low‑level hardware characteristics—cache layout, prefetching, associativity, false sharing, pipeline behavior, and data dependencies—affect the performance of Rust programs and provides practical techniques (proper data layout, alignment, avoiding dependent loops) to achieve near‑optimal speed.

performance optimizationRustpipelineCPU cachedata dependencyfalse sharingprefetching
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.