Backend Development 10 min read

Why Go Channels Slow Down on More CPUs and How to Fix It

A Fastly engineer discovered that increasing CPU cores can degrade Go channel performance due to lock contention, and this article reproduces the benchmarks, explains why goroutine count—not CPU count—is the real culprit, and offers practical optimization techniques.

BirdNest Tech Talk

Aug 28, 2024

Why Go Channels Slow Down on More CPUs and How to Fix It

During a recent GopherConUK talk, a Fastly engineer observed that scaling a server to more CPU cores unexpectedly slowed down Go channel throughput. Profiling revealed that the slowdown stemmed from increased contention on the channel lock, which the presenter attributed to the number of OS threads (GOMAXPROCS) rather than the underlying CPU count.

Four practical mitigations

Reduce the GOMAXPROCS setting (e.g., use a quarter of the physical cores) so that the number of threads competing for the channel lock stays limited.

Apply a timeout when sending; if a message cannot be placed into the channel within a certain period, drop it or retry later.

Shard the workload across multiple independent channels, often by hashing a key and selecting a channel via modulo.

Introduce a buffer that batches elements before pushing them into the channel, thereby decreasing the number of items the channel must handle at any moment.

The presenter’s headline "Go Channels slow down as CPU count increases" is misleading because the real factor is the number of goroutines contending for the channel lock, not the raw CPU count.

Reproducing the original benchmark

func BenchmarkChannel(b *testing.B) {
    var ps = []int{1, 2, 4, 8, 16, 32, 64, 128}
    for _, p := range ps {
        b.Run("P="+strconv.Itoa(p), func(b *testing.B) {
            benchmarkChannel_WithP(b, p)
        })
    }
}

func benchmarkChannel_WithP(b *testing.B, p int) {
    n := runtime.GOMAXPROCS(p)
    defer runtime.GOMAXPROCS(n)

    ch := make(chan int, p)
    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            ch <- 1
            <-ch
        }
    })
}

The code uses b.RunParallel and varies GOMAXPROCS, which can be mistaken for a direct CPU‑count effect because the benchmark name includes "P".

Extended experiment: varying goroutine count

func BenchmarkChannelPC(b *testing.B) {
    b.Run("P=1, C=1", func(b *testing.B) { benchmarkChannel_WithPC(b, 1, 1) })
    b.Run("P=1, C=128", func(b *testing.B) { benchmarkChannel_WithPC(b, 1, 128) })
    b.Run("P=128, C=4", func(b *testing.B) { benchmarkChannel_WithPC(b, 128, 1) })
    b.Run("P=128, C=128", func(b *testing.B) { benchmarkChannel_WithPC(b, 128, 128) })
}

func benchmarkChannel_WithPC(b *testing.B, p, c int) {
    n := runtime.GOMAXPROCS(p)
    defer runtime.GOMAXPROCS(n)

    ch := make(chan int, 1024)
    var wg sync.WaitGroup
    wg.Add(c)
    b.ResetTimer()
    for i := 0; i < c; i++ {
        go func() {
            defer wg.Done()
            for i := 0; i < b.N; i++ {
                ch <- 1
                <-ch
            }
        }()
    }
    wg.Wait()
}

Running this on an Apple M2 (darwin/arm64) produced the following results:

goos: darwin
goarch: arm64
pkg: github.com/smallnest/study/benchtest
cpu: Apple M2
BenchmarkChannelPC
BenchmarkChannelPC/P=1,_C=1-8          38275527          31.20 ns/op        0 B/op        0 allocs/op
BenchmarkChannelPC/P=1,_C=128-8        298050           3982 ns/op        0 B/op        0 allocs/op
BenchmarkChannelPC/P=128,_C=4-8      38419054          31.11 ns/op        0 B/op        0 allocs/op
BenchmarkChannelPC/P=128,_C=128-8    93849            11442 ns/op        0 B/op        0 allocs/op

The data show that when the number of goroutines (C) is large, latency spikes dramatically regardless of the GOMAXPROCS value (P). With a single goroutine the channel operates at ~31 ns/op, but with 128 concurrent goroutines the cost rises to several microseconds, confirming that lock contention is driven by goroutine count.

Further scaling with fixed P and varying C

func BenchmarkChannelC(b *testing.B) {
    var cs = []int{1,2,4,8,16,32,64,128}
    for _, c := range cs {
        b.Run("P=4, C="+strconv.Itoa(c), func(b *testing.B) {
            benchmarkChannel_WithPC(b, 4, c)
        })
    }
}

func benchmarkChannel_WithPC(b *testing.B, p, c int) {
    n := runtime.GOMAXPROCS(p)
    defer runtime.GOMAXPROCS(n)

    ch := make(chan int, 1024)
    var wg sync.WaitGroup
    wg.Add(c)
    b.ResetTimer()
    for i := 0; i < c; i++ {
        go func() {
            defer wg.Done()
            for i := 0; i < b.N; i++ {
                ch <- 1
                <-ch
            }
        }()
    }
    wg.Wait()
    b.ReportMetric(float64(b.N*c)/float64(b.Elapsed().Milliseconds()), "count/ms")
}

This experiment reinforces the conclusion: increasing the number of goroutines degrades channel performance, while the CPU core count (fixed at 4 in the example) has little effect.

Takeaways

The slowdown is caused by goroutine‑level lock contention, not by the raw number of CPU cores.

Limiting GOMAXPROCS can reduce contention but should be balanced against the need for parallelism.

Sharding channels, buffering batches, and applying timeouts are effective mitigation strategies.

When designing high‑throughput Go services, measure both P (threads) and C (goroutine count) to pinpoint the real bottleneck.

"The presenter’s PPT code differs from the actual benchmark; the speaker used the -test.cpu flag to vary concurrency, while the reproduced code manipulates GOMAXPROCS directly."

Finally, the author notes that a single‑goroutine channel is fast, but real workloads require multiple goroutines to process logs or other data in parallel, so the goal is to balance concurrency with channel contention.

References

Grant Stephens, Fastly. "Go Channels slow down with more CPUs". YouTube. https://www.youtube.com/watch?v=VrNmkRAuF9s

# Go Channels 随着 CPU 的增加而变慢. Bilibili. https://www.bilibili.com/video/BV17fWheME71

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance concurrency Go benchmark GOMAXPROCS Goroutine Channels

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.