Why My Go Service Slowed Down on a 128‑Core Server

A 128‑core, 256‑thread server should boost Go microservice performance, but the author explains how NUMA architecture, Go's scheduler affinity loss during GC pauses, and non‑NUMA‑aware memory allocation cause cache misses, remote memory penalties, and higher latency, preventing linear scaling.

TonyBai
TonyBai
TonyBai
Why My Go Service Slowed Down on a 128‑Core Server

Go Scheduler’s "Intermittent Amnesia"

On machines with up to about 32 cores, Go’s GMP (Goroutine‑Processor‑Machine) scheduler keeps a goroutine on the same processor and OS thread, preserving L1/L2 cache locality. When the hardware jumps to 128 cores (NumCPU() returns 256), this affinity is broken.

During a GC stop‑the‑world (STW) pause, all Ps are frozen. After the pause, the scheduler may reassign a revived P to any idle M, which the author likens to being moved from a well‑stocked desk to a random one, causing severe cache misses.

Enabling the Execution Trace shows goroutines hopping between many CPUs within a few milliseconds, creating a performance black hole.

NUMA Penalty in the Double‑Cross‑Region Traffic

In a 128‑core CPU, memory is divided into several NUMA nodes (each with 16‑64 cores). Accessing local node memory is fast, while remote node access can be twice as slow or more.

CPU reading its own node’s memory: very fast.

CPU accessing remote node memory: latency spikes 2× or higher.

Go is currently not NUMA‑aware. When allocating with new(struct), the global free list may hand out memory from Node 1 to a goroutine running on Node 0, forcing every subsequent read/write to incur remote‑node latency.

The work‑stealing algorithm, once a strength, now becomes a liability: a stealing CPU executes a task whose data remains on the original NUMA node, analogous to stealing bricks but having to transport them across a city each time.

2026: Go Team’s Breakthrough Plan

In Go issues #65694 and #78044, core member Michael Pratt states that eliminating performance bottlenecks on ultra‑high core counts and NUMA is a top priority for the year.

Planned improvements include:

Fixing the "amnesia" : a CL 714801 patch makes the runtime try to re‑bind a P to the same M after STW, preserving cache affinity.

Taming GC preemption : new scheduling logic avoids GC workers evicting running goroutines, keeping execution environments coherent.

Exploring NUMA‑aware memory allocation : future versions aim to prefer local‑node memory and prioritize work stealing within the same NUMA node.

Practical Guidance for Cloud‑Native Developers

Before rewriting code, check NUMA hit rates with top and numastat. For latency‑critical workloads, consider binding the process to a specific NUMA node using runtime.LockOSThread() or cgroup pinning.

Understanding the physical footprint of code on modern hardware is essential as single‑core performance plateaus and core counts soar toward 256‑512 cores.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ConcurrencyGoschedulerGarbage CollectionnumaHigh‑core performance
TonyBai
Written by

TonyBai

Tony Bai's tech world (tonybai.com). Not satisfied with just "knowing how", we strive for mastery. Focused on Go language internals, high-quality engineering practices, and cloud‑native architecture, exploring cutting‑edge intersections of Go and AI. Gophers who pursue technology are welcome—follow me and evolve with Go.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.