Fundamentals 10 min read

Boost Go Performance: Harness CPU Cache Locality with Practical Tips

This article explains the CPU cache locality principle, shows how to restructure Go data access patterns—including data structures, field ordering, memory allocation, and false sharing avoidance—and demonstrates measurable performance gains with a matrix‑multiplication benchmark.

Ops Development & AI Practice

May 16, 2024

Boost Go Performance: Harness CPU Cache Locality with Practical Tips

Modern computer systems rely on CPU caches to improve program performance. The cache works on the locality principle: programs tend to access the same data or instructions repeatedly within a short time span (temporal locality) and to access neighboring data (spatial locality). Leveraging these properties can significantly reduce memory latency.

What Is the Locality Principle?

Temporal locality : If a datum is accessed once, it is likely to be accessed again soon.

Spatial locality : If a datum is accessed, nearby data are also likely to be accessed.

By designing programs that exploit both forms of locality, CPU caches can dramatically lower access delays and speed up execution.

Optimizing Data Access Patterns in Go

Several strategies can improve cache utilization in Go programs:

1. Data‑Structure Optimization

Prefer contiguous structures such as arrays over linked lists, because arrays provide better spatial locality.

// Not recommended: linked list
type Node struct {
    Value int
    Next  *Node
}

// Recommended: array
func sumArray(arr []int) int {
    sum := 0
    for _, v := range arr {
        sum += v
    }
    return sum
}

2. Field Ordering (Data Layout)

Place fields that are frequently accessed together next to each other in a struct to improve cache line usage.

// Not recommended
type Person struct {
    ID      int
    Name    string
    Age     int
    Address string
    Phone   string
    Email   string
}

// Recommended
type Person struct {
    ID    int
    Age   int
    Name  string
    Phone string
    Email string
    Address string
}

3. Reduce Memory Allocations

Frequent allocations cause cache misses. Pre‑allocate slices or use object pools to minimize allocation overhead.

// Not recommended: allocate inside the loop
func process(data []int) {
    for _, v := range data {
        temp := make([]int, 100)
        temp[0] = v
    }
}

// Recommended: allocate once
func process(data []int) {
    temp := make([]int, 100)
    for _, v := range data {
        temp[0] = v
    }
}

4. Avoid False Sharing

False sharing occurs when multiple goroutine‑running cores write to different variables that reside on the same cache line. Adding padding aligns variables to separate cache lines.

type Data struct {
    Value1 int
    _      [56]byte // padding to avoid false sharing
    Value2 int
}

Practical Case Study: Matrix Multiplication

A cache‑friendly matrix multiplication traverses rows first, keeping accesses to adjacent memory locations.

Cache‑Friendly Implementation

func multiplyMatrices(a, b [][]int) [][]int {
    n := len(a)
    result := make([][]int, n)
    for i := range result {
        result[i] = make([]int, n)
    }
    for i := 0; i < n; i++ {
        for j := 0; j < n; j++ {
            sum := 0
            for k := 0; k < n; k++ {
                sum += a[i][k] * b[k][j]
            }
            result[i][j] = sum
        }
    }
    return result
}

Non‑Cache‑Friendly Implementation

func multiplyMatricesNoCache(a, b [][]int) [][]int {
    n := len(a)
    result := make([][]int, n)
    for i := range result {
        result[i] = make([]int, n)
    }
    for j := 0; j < n; j++ {
        for i := 0; i < n; i++ {
            sum := 0
            for k := 0; k < n; k++ {
                sum += a[i][k] * b[k][j]
            }
            result[i][j] = sum
        }
    }
    return result
}

Benchmark Setup

The following benchmark generates random matrices and measures both implementations.

package main

import (
    "math/rand"
    "testing"
)

func generateRandomMatrix(n int) [][]int {
    matrix := make([][]int, n)
    for i := range matrix {
        matrix[i] = make([]int, n)
        for j := range matrix[i] {
            matrix[i][j] = rand.Intn(100)
        }
    }
    return matrix
}

func BenchmarkMultiplyMatrices(b *testing.B) {
    n := 100
    a := generateRandomMatrix(n)
    bMatrix := generateRandomMatrix(n)
    for i := 0; i < b.N; i++ {
        multiplyMatrices(a, bMatrix)
    }
}

func BenchmarkMultiplyMatricesNoCache(b *testing.B) {
    n := 100
    a := generateRandomMatrix(n)
    bMatrix := generateRandomMatrix(n)
    for i := 0; i < b.N; i++ {
        multiplyMatricesNoCache(a, bMatrix)
    }
}

Run the benchmarks with:

go test -bench="BenchmarkMultiplyMatricesNoCache,BenchmarkMultiplyMatrices"

Results

The cache‑friendly version ( BenchmarkMultiplyMatrices) averages about 1.04 ms per operation, while the non‑cache version averages about 1.31 ms, demonstrating a clear performance advantage when locality is respected.

Conclusion

Understanding and applying CPU cache locality principles can markedly improve Go program performance. Key tactics include choosing cache‑friendly data structures, arranging struct fields for better spatial locality, minimizing allocations, and preventing false sharing. Tailoring these strategies to real‑world code yields faster, more efficient applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Go benchmark Matrix multiplication CPU cache cache locality

Written by

Ops Development & AI Practice

DevSecOps engineer sharing experiences and insights on AI, Web3, and Claude code development. Aims to help solve technical challenges, improve development efficiency, and grow through community interaction. Feel free to comment and discuss.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.