Understanding False Sharing and Cache Padding in Go
This article explains the concept of false sharing caused by CPU cache line interactions, demonstrates how cache padding can mitigate the performance penalty, and provides Go benchmark code and results to illustrate the impact on multi‑core concurrency.
Before discussing false sharing, it is necessary to briefly introduce how CPU cache works. The smallest unit in a CPU cache is a cache line (typically 64 bytes), so when a core reads a variable from memory it also loads nearby variables.
When core1 reads variable a , it also brings variable b into its cache line, based on the principle of spatial locality. If two variables that reside on different cores share the same cache line, updating one variable forces the other core to invalidate its cache, even if the second variable was not modified.
This phenomenon is called false sharing : an update by one core forces other cores to reload the entire cache line, dramatically reducing performance because cache reads are much faster than memory reads.
A common solution is cache padding , inserting meaningless variables between frequently accessed fields so that each field occupies its own cache line, preventing other cores from invalidating it.
Below is a Go example illustrating false sharing. The first struct type NoPad struct { a uint64; b uint64; c uint64 } places three uint64 fields consecutively, while the second struct adds padding arrays _p1 [8]uint64 , _p2 [8]uint64 , _p3 [8]uint64 to separate the fields:
type NoPad struct {
a uint64
b uint64
c uint64
}
func (myatomic *NoPad) IncreaseAllEles() {
atomic.AddUint64(&myatomic.a, 1)
atomic.AddUint64(&myatomic.b, 1)
atomic.AddUint64(&myatomic.c, 1)
} type Pad struct {
a uint64
_p1 [8]uint64
b uint64
_p2 [8]uint64
c uint64
_p3 [8]uint64
}
func (myatomic *Pad) IncreaseAllEles() {
atomic.AddUint64(&myatomic.a, 1)
atomic.AddUint64(&myatomic.b, 1)
atomic.AddUint64(&myatomic.c, 1)
}A benchmark driver runs many goroutines that repeatedly call IncreaseAllEles on each struct:
func testAtomicIncrease(myatomic MyAtomic) {
paraNum := 1000
addTimes := 1000
var wg sync.WaitGroup
wg.Add(paraNum)
for i := 0; i < paraNum; i++ {
go func() {
for j := 0; j < addTimes; j++ {
myatomic.IncreaseAllEles()
}
wg.Done()
}()
}
wg.Wait()
}
func BenchmarkNoPad(b *testing.B) {
myatomic := &NoPad{}
b.ResetTimer()
testAtomicIncrease(myatomic)
}
func BenchmarkPad(b *testing.B) {
myatomic := &Pad{}
b.ResetTimer()
testAtomicIncrease(myatomic)
}On a 2014 MBA the original benchmark reported a speedup from 0.07 ns/op to 0.02 ns/op when using padding. However, on a 2022 M2‑chip MBA the author observed opposite results, with the padded version being slower.
Further experiments compare a version without padding to one with explicit cache padding ( _ [8]uint64 ) in a parallel benchmark. The padded version shows a dramatic improvement (22.09 ns/op down to 1.075 ns/op) because each goroutine updates a value that resides on a separate cache line, eliminating false sharing.
Before applying cache padding in production, two key points must be considered: (1) know the cache‑line size of the target CPU to choose an appropriate padding size, and (2) padding increases memory consumption, so benchmark to ensure the performance gain justifies the extra memory.
All example code is available on GitHub, and readers are encouraged to run their own benchmarks to verify the effect of false sharing and cache padding.
Go Programming World
Mobile version of tech blog https://jianghushinian.cn/, covering Golang, Docker, Kubernetes and beyond.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.