Fundamentals 24 min read

Unveiling Go’s runq: Lock‑Free Queues Behind the Scheduler

This article dissects Go's runtime GPM model and the lock‑free runq data structure, detailing its fields, core operations such as runqput, runqget, runqgrab, and their atomic implementations, while also comparing local and global queues and illustrating the code paths with concrete examples.

BirdNest Tech Talk

Oct 20, 2024

Unveiling Go’s runq: Lock‑Free Queues Behind the Scheduler

We begin by revisiting Go's runtime GPM model, where G represents a goroutine, P an abstract processor (usually matching the number of CPU cores), and M an OS thread. The model connects G to M via P, enabling user‑space scheduling and reducing system‑call overhead.

G represents a goroutine that occupies only a few kilobytes of memory and includes its stack, instruction pointer, and other metadata such as the waiting queue for a blocked channel. P represents a processor, an abstract CPU core whose default count equals the actual core count but can be adjusted via environment variables; it maintains a local goroutine queue and executes goroutines. M represents a machine, i.e., an OS thread that must be bound to a P to run a goroutine. When an M blocks, the runtime creates or reuses another M to keep the number of Ps equal to GOMAXPROCS , fully utilizing CPU resources.

The scheduler uses a work‑stealing algorithm: each P has a local run queue ( runq) and a global run queue. When a P's local queue is empty, it steals half of another P's queue, balancing load dynamically.

runq Data Structure

The runq is a lock‑free circular queue of fixed length 256, stored inside the p struct. Important fields include:

type p struct {
    id          int32
    status      uint32 // one of pidle/prunning/...
    runqhead    uint32
    runqtail    uint32
    runq       [256]guintptr
    runnext     guintptr
    // ... other fields omitted for brevity
}

runqhead

points to the head of the queue, runqtail to the tail, and each element of runq holds a guintptr (an alias of uintptr) that stores a pointer to a g (goroutine).

runqput

runqput

inserts a goroutine into the local run queue without blocking. The implementation first checks whether the runnext slot should be used; if so, the goroutine is placed there, otherwise it is appended to the tail of runq. If the queue is full, the slow path runqputslow moves half of the local queue to the global queue.

// runqput tries to place g on the local runnable queue.
func runqput(pp *p, gp *g, next bool) {
    if !haveSysmon && next {
        next = false // avoid runnext when sysmon is absent
    }
    if randomizeScheduler && next && randn(2) == 0 {
        next = false
    }
    if next {
        // fast path for runnext
        retryNext:
        oldnext := pp.runnext
        if !pp.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
            goto retryNext
        }
        if oldnext == 0 {
            return
        }
        gp = oldnext.ptr()
    }
    // main insertion into runq
    retry:
    h := atomic.LoadAcq(&pp.runqhead) // ① load head
    t := pp.runqtail
    if t-h < uint32(len(pp.runq)) { // ② queue not full
        pp.runq[t%uint32(len(pp.runq))].set(gp) // ③ store g
        atomic.StoreRel(&pp.runqtail, t+1)      // ④ advance tail
        return
    }
    if runqputslow(pp, gp, h, t) { // ⑤ slow path when full
        return
    }
    goto retry
}

The comments ①‑⑤ correspond to the author's original annotations, showing how the head is loaded, the fullness check, the store, the tail update, and the fallback to runqputslow.

runqputslow

If the local queue is full, runqputslow moves roughly half of the elements (including the new goroutine) to the global queue, which is protected by a lock, making this the performance bottleneck.

runqputbatch

runqputbatch

batches insertion of multiple goroutines, typically stolen from another P. It loops while the local queue has space, stores each goroutine, and updates the tail atomically. If the queue fills, it acquires sched.lock and pushes the remaining batch to the global queue.

func runqputbatch(pp *p, q *gQueue, qsize int) {
    h := atomic.LoadAcq(&pp.runqhead) // ① load head
    t := pp.runqtail
    n := uint32(0)
    for !q.empty() && t-h < uint32(len(pp.runq)) { // ② fill while possible
        gp := q.pop()
        pp.runq[t%uint32(len(pp.runq))].set(gp)
        t++
        n++
    }
    qsize -= int(n)
    if randomizeScheduler {
        // shuffle the newly added entries
        off := func(o uint32) uint32 { return (pp.runqtail + o) % uint32(len(pp.runq)) }
        for i := uint32(1); i < n; i++ {
            j := cheaprandn(i + 1)
            pp.runq[off(i)], pp.runq[off(j)] = pp.runq[off(j)], pp.runq[off(i)]
        }
    }
    atomic.StoreRel(&pp.runqtail, t) // ④ update tail
    if !q.empty() {
        lock(&sched.lock)
        globrunqputbatch(q, int32(qsize))
        unlock(&sched.lock)
    }
}

The author highlights why atomic.LoadAcq is used for the head but not for the tail in the subsequent blockquote.

It’s worth pondering why the next line fetches the tail without atomic.LoadAcq .

runqget

runqget

removes a goroutine from the local queue. It first checks runnext; if present, it atomically swaps it out. Otherwise it loads the head, verifies the queue is non‑empty, reads the goroutine at head % len(runq), and uses atomic.CasRel to advance the head.

func runqget(pp *p) (gp *g, inheritTime bool) {
    next := pp.runnext
    if next != 0 && pp.runnext.cas(next, 0) { // ① fast path for runnext
        return next.ptr(), true
    }
    for {
        h := atomic.LoadAcq(&pp.runqhead) // ② load head
        t := pp.runqtail
        if t == h { // ③ empty
            return nil, false
        }
        gp := pp.runq[h%uint32(len(pp.runq))].ptr() // ④ read g
        if atomic.CasRel(&pp.runqhead, h, h+1) { // ⑤ advance head
            return gp, false
        }
    }
}

runqdrain

runqdrain

extracts all goroutines from a P's local queue, returning them in a gQueue. It first moves runnext if set, then atomically swaps the head and tail to obtain the range of pending entries, and finally pushes each element into the drain queue.

func runqdrain(pp *p) (drainQ gQueue, n uint32) {
    oldNext := pp.runnext
    if oldNext != 0 && pp.runnext.cas(oldNext, 0) {
        drainQ.pushBack(oldNext.ptr())
        n++
    }
retry:
    h := atomic.LoadAcq(&pp.runqhead) // ② load head
    t := pp.runqtail
    qn := t - h
    if qn == 0 {
        return
    }
    if qn > uint32(len(pp.runq)) { // ③ sanity check
        goto retry
    }
    if !atomic.CasRel(&pp.runqhead, h, h+qn) { // ④ update head
        goto retry
    }
    for i := uint32(0); i < qn; i++ { // ⑤ move to drainQ
        gp := pp.runq[(h+i)%uint32(len(pp.runq))].ptr()
        drainQ.pushBack(gp)
        n++
    }
    return
}

runqgrab

runqgrab

is used by the stealing logic to take roughly half of a P's local queue. It calculates the number to steal, optionally steals runnext, copies the selected goroutines into a provided batch buffer, and atomically advances the head.

func runqgrab(pp *p, batch *[256]guintptr, batchHead uint32, stealRunNextG bool) uint32 {
    for {
        h := atomic.LoadAcq(&pp.runqhead)
        t := atomic.LoadAcq(&pp.runqtail)
        n := t - h
        n = n - n/2 // ① take half
        if n == 0 {
            if stealRunNextG {
                if next := pp.runnext; next != 0 {
                    if pp.status == _prunning {
                        if !osHasLowResTimer {
                            usleep(3)
                        } else {
                            osyield()
                        }
                    }
                    if !pp.runnext.cas(next, 0) {
                        continue
                    }
                    batch[batchHead%uint32(len(batch))] = next
                    return 1
                }
            }
            return 0
        }
        if n > uint32(len(pp.runq)/2) { // ③ avoid stealing too many
            continue
        }
        for i := uint32(0); i < n; i++ {
            g := pp.runq[(h+i)%uint32(len(pp.runq))]
            batch[(batchHead+i)%uint32(len(batch))] = g
        }
        if atomic.CasRel(&pp.runqhead, h, h+n) { // ⑤ update head
            return n
        }
    }
}

runqsteal

runqsteal

steals a half‑filled batch from another P using runqgrab, then extracts one goroutine from the stolen batch and returns it. If the batch is empty, it returns nil. The function also updates the tail of the stealing P.

func runqsteal(pp, p2 *p, stealRunNextG bool) *g {
    t := pp.runqtail
    n := runqgrab(p2, &pp.runq, t, stealRunNextG) // ① steal half from p2
    if n == 0 {
        return nil
    }
    n--
    gp := pp.runq[(t+n)%uint32(len(pp.runq))].ptr() // ② get one g
    if n == 0 {
        return gp
    }
    h := atomic.LoadAcq(&pp.runqhead) // ③ load head
    if t-h+n >= uint32(len(pp.runq)) { // ④ overflow check
        throw("runqsteal: runq overflow")
    }
    atomic.StoreRel(&pp.runqtail, t+n) // ⑤ update tail
    return gp
}

gQueue and gList

Go runtime uses two queue types to store goroutines: gQueue: a double‑ended queue with head and tail pointers, supporting push, pushBack, pushBackAll, pop, and popList. gList: a singly‑linked list using the schedlink field of g, offering push, pushAll, and pop.

type gQueue struct {
    head guintptr
    tail guintptr
}
func (q *gQueue) empty() bool { return q.head == 0 }
func (q *gQueue) push(gp *g) { gp.schedlink = q.head; q.head.set(gp); if q.tail == 0 { q.tail.set(gp) } }
func (q *gQueue) pushBack(gp *g) { gp.schedlink = 0; if q.tail != 0 { q.tail.ptr().schedlink.set(gp) } else { q.head.set(gp) }; q.tail.set(gp) }
func (q *gQueue) pushBackAll(q2 gQueue) { if q2.tail == 0 { return }; q2.tail.ptr().schedlink = 0; if q.tail != 0 { q.tail.ptr().schedlink = q2.head } else { q.head = q2.head }; q.tail = q2.tail }
func (q *gQueue) pop() *g { gp := q.head.ptr(); if gp != nil { q.head = gp.schedlink; if q.head == 0 { q.tail = 0 } }; return gp }
func (q *gQueue) popList() gList { stack := gList{q.head}; *q = gQueue{}; return stack }

type gList struct { head guintptr }
func (l *gList) empty() bool { return l.head == 0 }
func (l *gList) push(gp *g) { gp.schedlink = l.head; l.head.set(gp) }
func (l *gList) pushAll(q gQueue) { if !q.empty() { q.tail.ptr().schedlink = l.head; l.head = q.head } }
func (l *gList) pop() *g { gp := l.head.ptr(); if gp != nil { l.head = gp.schedlink }; return gp }

These implementations can be compared with textbook lock‑free queue designs to see how Go adapts them for the scheduler.

Global runq

The global runnable queue ( sched.runq) handles overflow when a P's local queue is exhausted. It is protected by sched.lock, making it simpler but slower than the lock‑free local queues.

var (
    sched schedt
)

type schedt struct {
    // ... other fields omitted
    runq     gQueue
    runqsize int32
}

func globrunqput(gp *g) {
    assertLockHeld(&sched.lock)
    sched.runq.pushBack(gp)
    sched.runqsize++
}

func globrunqputhead(gp *g) {
    assertLockHeld(&sched.lock)
    sched.runq.push(gp)
    sched.runqsize++
}

func globrunqputbatch(batch *gQueue, n int32) {
    assertLockHeld(&sched.lock)
    sched.runq.pushBackAll(*batch)
    sched.runqsize += n
    *batch = gQueue{}
}

func globrunqget(pp *p, max int32) *g {
    assertLockHeld(&sched.lock)
    if sched.runqsize == 0 { return nil }
    n := sched.runqsize/gomaxprocs + 1
    if n > sched.runqsize { n = sched.runqsize }
    if max > 0 && n > max { n = max }
    if n > int32(len(pp.runq))/2 { n = int32(len(pp.runq))/2 }
    sched.runqsize -= n
    gp := sched.runq.pop()
    n--
    for ; n > 0; n-- {
        gp1 := sched.runq.pop()
        runqput(pp, gp1, false)
    }
    return gp
}

The author explains the calculation of n (how many goroutines to pull from the global queue) and the guard against stealing more than half of a local queue, preserving fairness.

Overall, the article walks through the complete reasoning chain—from the high‑level GPM model, through the concrete lock‑free runq implementation, to the interactions with the global queue—providing code‑level insight into why Go can achieve high concurrency with minimal contention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

concurrency Go Scheduler Runtime Data Structures lock‑free runq

Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.