Fundamentals 27 min read

Unlock 95% CPU Utilization in Go: 7 Scheduler Pitfalls and Real‑World Fixes

This article examines why Go programs often suffer from low CPU usage, explores seven common scheduler pitfalls through real production cases, and provides concrete techniques—such as separating I/O from CPU work, tuning GOMAXPROCS, and using worker pools—to boost utilization from 30% to 95% and dramatically improve latency.

Code Wrench

Feb 6, 2026

Unlock 95% CPU Utilization in Go: 7 Scheduler Pitfalls and Real‑World Fixes

Why CPU Utilization May Stay Low

On a 16‑core server we observed 15% CPU usage, >120 000 goroutines and request latency >8 s. The root cause was that many goroutines performed blocking I/O (e.g., os.ReadFile) while holding a logical processor (P). Each blocked OS thread (M) caused the associated P to be unavailable, filling the global run queue and preventing other goroutines from being scheduled.

GMP Scheduler Model

Core Entities

G (goroutine) : User‑level thread. Starts with a 2 KB stack that grows/shrinks automatically. Creation cost ≈ 2 µs, context‑switch cost ≈ 0.2 µs.

M (machine) : OS thread that actually executes code. Default hard limit 10 000 (adjustable via debug.SetMaxThreads). Creation cost ≈ 1 ms, switch cost ≈ 1 µs.

P (processor) : Logical processor that owns a local run queue (max 256 G) and an mcache. Number of Ps defaults to runtime.NumCPU() and can be changed with runtime.GOMAXPROCS.

Why a Separate P?

Early Go (1.0) let Ms pull directly from a single global queue, causing massive lock contention and poor cache locality. The modern M‑P‑G design gives each P a private queue, eliminating the global lock for most scheduling operations and improving performance by roughly tenfold (e.g., creating 1 M goroutine took ~800 ms in Go 1.0 vs ~80 ms in Go 1.1).

When Does a Goroutine Switch?

Explicit Yield

func worker() {
    for i := 0; i < 1_000_000; i++ {
        compute()
        runtime.Gosched() // give other Gs a chance
    }
}

func producer() { ch <- data } // blocks if channel full
func critical() { mu.Lock(); /* ... */; mu.Unlock() }
func handler(conn net.Conn) { conn.Read(buf) } // netpoller wakes the G
func delayTask() { time.Sleep(time.Second) }

Preemptive Scheduling (Go 1.14+)

The runtime checks every 10 ms; if a goroutine runs longer than the threshold it is preempted, placed back on a run queue, and another goroutine is scheduled.

func main() {
    runtime.GOMAXPROCS(1)
    go func() { for {} }() // would dead‑lock before Go 1.14
    done := make(chan bool, 1)
    go func() { time.Sleep(100 * time.Millisecond); fmt.Println("I can run now!"); done <- true }()
    <-done
}

System Call Switching

// Blocking I/O
func loadFile(path string) []byte {
    data, _ := os.ReadFile(path) // blocks the M
    return data
}

// Non‑blocking network I/O (netpoller)
func handler(conn net.Conn) {
    buf := make([]byte, 1024)
    conn.Read(buf) // G is parked on netpoller, M runs other Gs
}

Work Stealing

When a P’s local queue is empty it attempts to steal half of another P’s queue (randomly chosen) up to four times before checking the global queue or netpoller.

func schedule() {
    gp := runqget(_p_)
    if gp == nil {
        gp = findrunnable()
    }
    execute(gp)
}

func findrunnable() *g {
    // 1. Re‑check local queue
    if gp := runqget(_p_); gp != nil { return gp }
    // 2. Global queue (checked every 61 schedules)
    if sched.runqsize > 0 {
        if gp := globrunqget(_p_, sched.runqsize/gomaxprocs+1); gp != nil { return gp }
    }
    // 3. Netpoller
    if netpollinited() {
        if gps := netpoll(0); len(gps) > 0 { return gps[0] }
    }
    // 4. Work stealing from other Ps
    for i := 0; i < 4; i++ {
        for _, p := range allp {
            if p == _p_ { continue }
            if gp := runqsteal(_p_, p); gp != nil { return gp }
        }
    }
    // 5‑7. Fallback checks, then spin
    return nil
}

Key stealing details: steal half of the victim’s queue, pick victims randomly, and steal from the tail to preserve cache locality.

Seven Common Scheduler Traps

1. Goroutine Leak

func handleRequest(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
    defer cancel()
    resultCh := make(chan []byte, 1)
    go func() {
        data := fetchFromDB()
        select {
        case resultCh <- data:
        case <-ctx.Done():
        }
    }()
    select {
    case data := <-resultCh:
        w.Write(data)
    case <-ctx.Done():
        http.Error(w, "timeout", http.StatusGatewayTimeout)
    }
}

Test for leaks with runtime.NumGoroutine() before and after the request.

2. Blocking System Calls Occupying P

type FileLoader struct { semaphore chan struct{} }
func NewFileLoader(concurrency int) *FileLoader {
    return &FileLoader{semaphore: make(chan struct{}, concurrency)}
}
func (fl *FileLoader) Load(path string) []byte {
    fl.semaphore <- struct{}{}
    defer func() { <-fl.semaphore }()
    data, _ := os.ReadFile(path)
    return data
}

loader := NewFileLoader(10)
for _, p := range files {
    go func(p string) {
        data := loader.Load(p)
        process(data)
    }(p)
}

3. Improper GOMAXPROCS Settings

// Too low – forces single‑core execution
runtime.GOMAXPROCS(1)

// Too high – creates many Ps, causing excess context switches
runtime.GOMAXPROCS(128) // on a 16‑core machine

Recommended: leave unset (defaults to CPU count) or use go.uber.org/automaxprocs in containers.

4. Global Queue Starvation

Local queues are checked on every schedule, while the global queue is inspected only every 61 schedules. Critical tasks that stay in the global queue can be delayed for a long time.

5. Ignoring sysmon

sysmon

is a special runtime thread that does preemption, span reclamation, netpoller wake‑ups, and GC triggers. Locking an M with runtime.LockOSThread() prevents sysmon from preempting long‑running computations.

func cpuIntensiveTask() {
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()
    for {
        // heavy computation
    }
}

6. Spinning Waste

When a P finds no runnable G, the M spins for a short period before sleeping. Go 1.5+ automatically tunes spin count; manual reduction is rarely needed, but lowering GOMAXPROCS can reduce spin overhead in very low‑load services.

7. Misusing runtime.LockOSThread

// Bad – permanent lock per request leads to M explosion
func handler(w http.ResponseWriter, r *http.Request) {
    runtime.LockOSThread()
    // …
}

Correct usage is limited to CGO calls, OpenGL/Metal rendering loops, or short critical sections where a thread‑local context is required.

func callC() {
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()
    C.some_function()
}

func renderLoop() {
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()
    for { render() }
}

Performance Optimizations in Practice

Case 1 – Image Processing Pipeline (CPU 30 % → 95 %)

type Pipeline struct {
    loadCh    chan string
    processCh chan image.Image
    saveCh    chan image.Image
}

func NewPipeline() *Pipeline {
    p := &Pipeline{
        loadCh:    make(chan string, 100),
        processCh: make(chan image.Image, 100),
        saveCh:    make(chan image.Image, 100),
    }
    // I/O workers (few)
    for i := 0; i < 10; i++ {
        go p.loadWorker()
        go p.saveWorker()
    }
    // CPU workers (one per core)
    for i := 0; i < runtime.NumCPU(); i++ {
        go p.processWorker()
    }
    return p
}

func (p *Pipeline) loadWorker() {
    for path := range p.loadCh {
        img := loadImage(path)
        p.processCh <- img
    }
}

func (p *Pipeline) processWorker() {
    for img := range p.processCh {
        processed := process(img)
        p.saveCh <- processed
    }
}

func (p *Pipeline) saveWorker() {
    for img := range p.saveCh {
        save(img)
    }
}

// Usage
pipeline := NewPipeline()
for _, path := range paths {
    pipeline.loadCh <- path
}
close(pipeline.loadCh)

Result: goroutine count drops from ~100 000 to ~50, CPU utilization rises to 95 %, and throughput triples.

Case 2 – Gateway Concurrency Control (P99 500 ms → 80 ms)

var queryPool = make(chan struct{}, 1000) // limit concurrent queries

func gateway(w http.ResponseWriter, r *http.Request) {
    results := make([]string, 100)
    var wg sync.WaitGroup
    for i := 0; i < 100; i++ {
        wg.Add(1)
        go func(idx int) {
            defer wg.Done()
            queryPool <- struct{}{}
            results[idx] = queryService(idx)
            <-queryPool
        }(i)
    }
    wg.Wait()
    w.Write([]byte(mergeResults(results)))
}

Goroutine count reduced from ~100 000 to ~10 000 under peak load, and latency dropped dramatically.

Debugging & Monitoring Tools

Runtime Helpers

import "runtime"

func monitorGoroutines() {
    ticker := time.NewTicker(10 * time.Second)
    for range ticker.C {
        fmt.Printf("Goroutines: %d
", runtime.NumGoroutine())
    }
}

func checkGOMAXPROCS() {
    fmt.Printf("GOMAXPROCS: %d
", runtime.GOMAXPROCS(0))
}

func forceGC() { runtime.GC() }

func memStats() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    fmt.Printf("Alloc: %d MB
", m.Alloc/1024/1024)
    fmt.Printf("TotalAlloc: %d MB
", m.TotalAlloc/1024/1024)
    fmt.Printf("NumGC: %d
", m.NumGC)
}

pprof Profiling

import _ "net/http/pprof"

func main() {
    go func() { http.ListenAndServe("localhost:6060", nil) }()
    // business logic …
}

// CPU profile:   go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// Goroutine dump: go tool pprof http://localhost:6060/debug/pprof/goroutine
// Block profile:  runtime.SetBlockProfileRate(1); go tool pprof http://localhost:6060/debug/pprof/block
// Mutex profile:  runtime.SetMutexProfileFraction(1); go tool pprof http://localhost:6060/debug/pprof/mutex

Trace Visualization

import "runtime/trace"

func main() {
    f, _ := os.Create("trace.out")
    trace.Start(f)
    defer trace.Stop()
    runServer()
}

// Then: go tool trace trace.out

The trace UI shows per‑P activity, goroutine states, netpoller events, and system‑call blocking, helping pinpoint why a goroutine spent seconds “waiting to be scheduled”.

Best‑Practice Checklist

Keep total goroutine count reasonable (< 10 000). Use pools or semaphores for massive parallelism.

Detect leaks with runtime.NumGoroutine() before/after a workload.

Set GOMAXPROCS appropriately: default on bare metal, automaxprocs in containers, or tune for CPU‑bound vs I/O‑bound workloads.

Avoid long‑running blocking system calls in goroutines; wrap them in a limited‑concurrency worker pool.

Monitor CPU utilization; near‑100 % indicates good saturation, near‑0 % may mean idle Ps.

Check lock contention with go tool pprof -mutex and block profiling.

Watch memory allocation hot spots with go tool pprof -alloc_space.

Use go tool trace to measure scheduling latency.

References

Go Scheduler Design Doc – https://golang.org/s/go11sched

Go 1.14 Async Preemption – https://github.com/golang/proposal/blob/master/design/24543-non-cooperative-preemption.md

Scalable Go Scheduler – https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw

scheduler cpu-utilization gomp

Written by

Code Wrench

Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Why CPU Utilization May Stay Low

GMP Scheduler Model

Core Entities

Why a Separate P?

When Does a Goroutine Switch?

Explicit Yield

Preemptive Scheduling (Go 1.14+)

System Call Switching

Work Stealing

Seven Common Scheduler Traps

1. Goroutine Leak

2. Blocking System Calls Occupying P

3. Improper GOMAXPROCS Settings

4. Global Queue Starvation

5. Ignoring sysmon

6. Spinning Waste

7. Misusing runtime.LockOSThread

Performance Optimizations in Practice

Case 1 – Image Processing Pipeline (CPU 30 % → 95 %)

Case 2 – Gateway Concurrency Control (P99 500 ms → 80 ms)

Debugging & Monitoring Tools

Runtime Helpers

pprof Profiling

Trace Visualization

Best‑Practice Checklist

References

Code Wrench

How this landed with the community

Was this worth your time?

0 Comments

Preemptive Scheduling (Go 1.14+)

Case 1 – Image Processing Pipeline (CPU 30 % → 95 %)

Case 2 – Gateway Concurrency Control (P99 500 ms → 80 ms)