Unlock 95% CPU Utilization in Go: 7 Scheduler Pitfalls and Real‑World Fixes
This article examines why Go programs often suffer from low CPU usage, explores seven common scheduler pitfalls through real production cases, and provides concrete techniques—such as separating I/O from CPU work, tuning GOMAXPROCS, and using worker pools—to boost utilization from 30% to 95% and dramatically improve latency.
Why CPU Utilization May Stay Low
On a 16‑core server we observed 15% CPU usage, >120 000 goroutines and request latency >8 s. The root cause was that many goroutines performed blocking I/O (e.g., os.ReadFile) while holding a logical processor (P). Each blocked OS thread (M) caused the associated P to be unavailable, filling the global run queue and preventing other goroutines from being scheduled.
GMP Scheduler Model
Core Entities
G (goroutine) : User‑level thread. Starts with a 2 KB stack that grows/shrinks automatically. Creation cost ≈ 2 µs, context‑switch cost ≈ 0.2 µs.
M (machine) : OS thread that actually executes code. Default hard limit 10 000 (adjustable via debug.SetMaxThreads). Creation cost ≈ 1 ms, switch cost ≈ 1 µs.
P (processor) : Logical processor that owns a local run queue (max 256 G) and an mcache. Number of Ps defaults to runtime.NumCPU() and can be changed with runtime.GOMAXPROCS.
Why a Separate P?
Early Go (1.0) let Ms pull directly from a single global queue, causing massive lock contention and poor cache locality. The modern M‑P‑G design gives each P a private queue, eliminating the global lock for most scheduling operations and improving performance by roughly tenfold (e.g., creating 1 M goroutine took ~800 ms in Go 1.0 vs ~80 ms in Go 1.1).
When Does a Goroutine Switch?
Explicit Yield
func worker() {
for i := 0; i < 1_000_000; i++ {
compute()
runtime.Gosched() // give other Gs a chance
}
}
func producer() { ch <- data } // blocks if channel full
func critical() { mu.Lock(); /* ... */; mu.Unlock() }
func handler(conn net.Conn) { conn.Read(buf) } // netpoller wakes the G
func delayTask() { time.Sleep(time.Second) }Preemptive Scheduling (Go 1.14+)
The runtime checks every 10 ms; if a goroutine runs longer than the threshold it is preempted, placed back on a run queue, and another goroutine is scheduled.
func main() {
runtime.GOMAXPROCS(1)
go func() { for {} }() // would dead‑lock before Go 1.14
done := make(chan bool, 1)
go func() { time.Sleep(100 * time.Millisecond); fmt.Println("I can run now!"); done <- true }()
<-done
}System Call Switching
// Blocking I/O
func loadFile(path string) []byte {
data, _ := os.ReadFile(path) // blocks the M
return data
}
// Non‑blocking network I/O (netpoller)
func handler(conn net.Conn) {
buf := make([]byte, 1024)
conn.Read(buf) // G is parked on netpoller, M runs other Gs
}Work Stealing
When a P’s local queue is empty it attempts to steal half of another P’s queue (randomly chosen) up to four times before checking the global queue or netpoller.
func schedule() {
gp := runqget(_p_)
if gp == nil {
gp = findrunnable()
}
execute(gp)
}
func findrunnable() *g {
// 1. Re‑check local queue
if gp := runqget(_p_); gp != nil { return gp }
// 2. Global queue (checked every 61 schedules)
if sched.runqsize > 0 {
if gp := globrunqget(_p_, sched.runqsize/gomaxprocs+1); gp != nil { return gp }
}
// 3. Netpoller
if netpollinited() {
if gps := netpoll(0); len(gps) > 0 { return gps[0] }
}
// 4. Work stealing from other Ps
for i := 0; i < 4; i++ {
for _, p := range allp {
if p == _p_ { continue }
if gp := runqsteal(_p_, p); gp != nil { return gp }
}
}
// 5‑7. Fallback checks, then spin
return nil
}Key stealing details: steal half of the victim’s queue, pick victims randomly, and steal from the tail to preserve cache locality.
Seven Common Scheduler Traps
1. Goroutine Leak
func handleRequest(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
defer cancel()
resultCh := make(chan []byte, 1)
go func() {
data := fetchFromDB()
select {
case resultCh <- data:
case <-ctx.Done():
}
}()
select {
case data := <-resultCh:
w.Write(data)
case <-ctx.Done():
http.Error(w, "timeout", http.StatusGatewayTimeout)
}
}Test for leaks with runtime.NumGoroutine() before and after the request.
2. Blocking System Calls Occupying P
type FileLoader struct { semaphore chan struct{} }
func NewFileLoader(concurrency int) *FileLoader {
return &FileLoader{semaphore: make(chan struct{}, concurrency)}
}
func (fl *FileLoader) Load(path string) []byte {
fl.semaphore <- struct{}{}
defer func() { <-fl.semaphore }()
data, _ := os.ReadFile(path)
return data
}
loader := NewFileLoader(10)
for _, p := range files {
go func(p string) {
data := loader.Load(p)
process(data)
}(p)
}3. Improper GOMAXPROCS Settings
// Too low – forces single‑core execution
runtime.GOMAXPROCS(1)
// Too high – creates many Ps, causing excess context switches
runtime.GOMAXPROCS(128) // on a 16‑core machineRecommended: leave unset (defaults to CPU count) or use go.uber.org/automaxprocs in containers.
4. Global Queue Starvation
Local queues are checked on every schedule, while the global queue is inspected only every 61 schedules. Critical tasks that stay in the global queue can be delayed for a long time.
5. Ignoring sysmon
sysmonis a special runtime thread that does preemption, span reclamation, netpoller wake‑ups, and GC triggers. Locking an M with runtime.LockOSThread() prevents sysmon from preempting long‑running computations.
func cpuIntensiveTask() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
for {
// heavy computation
}
}6. Spinning Waste
When a P finds no runnable G, the M spins for a short period before sleeping. Go 1.5+ automatically tunes spin count; manual reduction is rarely needed, but lowering GOMAXPROCS can reduce spin overhead in very low‑load services.
7. Misusing runtime.LockOSThread
// Bad – permanent lock per request leads to M explosion
func handler(w http.ResponseWriter, r *http.Request) {
runtime.LockOSThread()
// …
}Correct usage is limited to CGO calls, OpenGL/Metal rendering loops, or short critical sections where a thread‑local context is required.
func callC() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
C.some_function()
}
func renderLoop() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
for { render() }
}Performance Optimizations in Practice
Case 1 – Image Processing Pipeline (CPU 30 % → 95 %)
type Pipeline struct {
loadCh chan string
processCh chan image.Image
saveCh chan image.Image
}
func NewPipeline() *Pipeline {
p := &Pipeline{
loadCh: make(chan string, 100),
processCh: make(chan image.Image, 100),
saveCh: make(chan image.Image, 100),
}
// I/O workers (few)
for i := 0; i < 10; i++ {
go p.loadWorker()
go p.saveWorker()
}
// CPU workers (one per core)
for i := 0; i < runtime.NumCPU(); i++ {
go p.processWorker()
}
return p
}
func (p *Pipeline) loadWorker() {
for path := range p.loadCh {
img := loadImage(path)
p.processCh <- img
}
}
func (p *Pipeline) processWorker() {
for img := range p.processCh {
processed := process(img)
p.saveCh <- processed
}
}
func (p *Pipeline) saveWorker() {
for img := range p.saveCh {
save(img)
}
}
// Usage
pipeline := NewPipeline()
for _, path := range paths {
pipeline.loadCh <- path
}
close(pipeline.loadCh)Result: goroutine count drops from ~100 000 to ~50, CPU utilization rises to 95 %, and throughput triples.
Case 2 – Gateway Concurrency Control (P99 500 ms → 80 ms)
var queryPool = make(chan struct{}, 1000) // limit concurrent queries
func gateway(w http.ResponseWriter, r *http.Request) {
results := make([]string, 100)
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func(idx int) {
defer wg.Done()
queryPool <- struct{}{}
results[idx] = queryService(idx)
<-queryPool
}(i)
}
wg.Wait()
w.Write([]byte(mergeResults(results)))
}Goroutine count reduced from ~100 000 to ~10 000 under peak load, and latency dropped dramatically.
Debugging & Monitoring Tools
Runtime Helpers
import "runtime"
func monitorGoroutines() {
ticker := time.NewTicker(10 * time.Second)
for range ticker.C {
fmt.Printf("Goroutines: %d
", runtime.NumGoroutine())
}
}
func checkGOMAXPROCS() {
fmt.Printf("GOMAXPROCS: %d
", runtime.GOMAXPROCS(0))
}
func forceGC() { runtime.GC() }
func memStats() {
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("Alloc: %d MB
", m.Alloc/1024/1024)
fmt.Printf("TotalAlloc: %d MB
", m.TotalAlloc/1024/1024)
fmt.Printf("NumGC: %d
", m.NumGC)
}pprof Profiling
import _ "net/http/pprof"
func main() {
go func() { http.ListenAndServe("localhost:6060", nil) }()
// business logic …
}
// CPU profile: go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// Goroutine dump: go tool pprof http://localhost:6060/debug/pprof/goroutine
// Block profile: runtime.SetBlockProfileRate(1); go tool pprof http://localhost:6060/debug/pprof/block
// Mutex profile: runtime.SetMutexProfileFraction(1); go tool pprof http://localhost:6060/debug/pprof/mutexTrace Visualization
import "runtime/trace"
func main() {
f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
runServer()
}
// Then: go tool trace trace.outThe trace UI shows per‑P activity, goroutine states, netpoller events, and system‑call blocking, helping pinpoint why a goroutine spent seconds “waiting to be scheduled”.
Best‑Practice Checklist
Keep total goroutine count reasonable (< 10 000). Use pools or semaphores for massive parallelism.
Detect leaks with runtime.NumGoroutine() before/after a workload.
Set GOMAXPROCS appropriately: default on bare metal, automaxprocs in containers, or tune for CPU‑bound vs I/O‑bound workloads.
Avoid long‑running blocking system calls in goroutines; wrap them in a limited‑concurrency worker pool.
Monitor CPU utilization; near‑100 % indicates good saturation, near‑0 % may mean idle Ps.
Check lock contention with go tool pprof -mutex and block profiling.
Watch memory allocation hot spots with go tool pprof -alloc_space.
Use go tool trace to measure scheduling latency.
References
Go Scheduler Design Doc – https://golang.org/s/go11sched
Go 1.14 Async Preemption – https://github.com/golang/proposal/blob/master/design/24543-non-cooperative-preemption.md
Scalable Go Scheduler – https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw
Code Wrench
Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
