Boost Go Performance: Harness CPU Cache Locality with Practical Tips
This article explains the CPU cache locality principle, shows how to restructure Go data access patterns—including data structures, field ordering, memory allocation, and false sharing avoidance—and demonstrates measurable performance gains with a matrix‑multiplication benchmark.
Modern computer systems rely on CPU caches to improve program performance. The cache works on the locality principle: programs tend to access the same data or instructions repeatedly within a short time span (temporal locality) and to access neighboring data (spatial locality). Leveraging these properties can significantly reduce memory latency.
What Is the Locality Principle?
Temporal locality : If a datum is accessed once, it is likely to be accessed again soon.
Spatial locality : If a datum is accessed, nearby data are also likely to be accessed.
By designing programs that exploit both forms of locality, CPU caches can dramatically lower access delays and speed up execution.
Optimizing Data Access Patterns in Go
Several strategies can improve cache utilization in Go programs:
1. Data‑Structure Optimization
Prefer contiguous structures such as arrays over linked lists, because arrays provide better spatial locality.
// Not recommended: linked list
type Node struct {
Value int
Next *Node
}
// Recommended: array
func sumArray(arr []int) int {
sum := 0
for _, v := range arr {
sum += v
}
return sum
}2. Field Ordering (Data Layout)
Place fields that are frequently accessed together next to each other in a struct to improve cache line usage.
// Not recommended
type Person struct {
ID int
Name string
Age int
Address string
Phone string
Email string
}
// Recommended
type Person struct {
ID int
Age int
Name string
Phone string
Email string
Address string
}3. Reduce Memory Allocations
Frequent allocations cause cache misses. Pre‑allocate slices or use object pools to minimize allocation overhead.
// Not recommended: allocate inside the loop
func process(data []int) {
for _, v := range data {
temp := make([]int, 100)
temp[0] = v
}
}
// Recommended: allocate once
func process(data []int) {
temp := make([]int, 100)
for _, v := range data {
temp[0] = v
}
}4. Avoid False Sharing
False sharing occurs when multiple goroutine‑running cores write to different variables that reside on the same cache line. Adding padding aligns variables to separate cache lines.
type Data struct {
Value1 int
_ [56]byte // padding to avoid false sharing
Value2 int
}Practical Case Study: Matrix Multiplication
A cache‑friendly matrix multiplication traverses rows first, keeping accesses to adjacent memory locations.
Cache‑Friendly Implementation
func multiplyMatrices(a, b [][]int) [][]int {
n := len(a)
result := make([][]int, n)
for i := range result {
result[i] = make([]int, n)
}
for i := 0; i < n; i++ {
for j := 0; j < n; j++ {
sum := 0
for k := 0; k < n; k++ {
sum += a[i][k] * b[k][j]
}
result[i][j] = sum
}
}
return result
}Non‑Cache‑Friendly Implementation
func multiplyMatricesNoCache(a, b [][]int) [][]int {
n := len(a)
result := make([][]int, n)
for i := range result {
result[i] = make([]int, n)
}
for j := 0; j < n; j++ {
for i := 0; i < n; i++ {
sum := 0
for k := 0; k < n; k++ {
sum += a[i][k] * b[k][j]
}
result[i][j] = sum
}
}
return result
}Benchmark Setup
The following benchmark generates random matrices and measures both implementations.
package main
import (
"math/rand"
"testing"
)
func generateRandomMatrix(n int) [][]int {
matrix := make([][]int, n)
for i := range matrix {
matrix[i] = make([]int, n)
for j := range matrix[i] {
matrix[i][j] = rand.Intn(100)
}
}
return matrix
}
func BenchmarkMultiplyMatrices(b *testing.B) {
n := 100
a := generateRandomMatrix(n)
bMatrix := generateRandomMatrix(n)
for i := 0; i < b.N; i++ {
multiplyMatrices(a, bMatrix)
}
}
func BenchmarkMultiplyMatricesNoCache(b *testing.B) {
n := 100
a := generateRandomMatrix(n)
bMatrix := generateRandomMatrix(n)
for i := 0; i < b.N; i++ {
multiplyMatricesNoCache(a, bMatrix)
}
}Run the benchmarks with:
go test -bench="BenchmarkMultiplyMatricesNoCache,BenchmarkMultiplyMatrices"Results
The cache‑friendly version ( BenchmarkMultiplyMatrices) averages about 1.04 ms per operation, while the non‑cache version averages about 1.31 ms, demonstrating a clear performance advantage when locality is respected.
Conclusion
Understanding and applying CPU cache locality principles can markedly improve Go program performance. Key tactics include choosing cache‑friendly data structures, arranging struct fields for better spatial locality, minimizing allocations, and preventing false sharing. Tailoring these strategies to real‑world code yields faster, more efficient applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development & AI Practice
DevSecOps engineer sharing experiences and insights on AI, Web3, and Claude code development. Aims to help solve technical challenges, improve development efficiency, and grow through community interaction. Feel free to comment and discuss.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
