Unveiling Go’s runq: Lock‑Free Queues Behind the Scheduler
This article dissects Go's runtime GPM model and the lock‑free runq data structure, detailing its fields, core operations such as runqput, runqget, runqgrab, and their atomic implementations, while also comparing local and global queues and illustrating the code paths with concrete examples.
We begin by revisiting Go's runtime GPM model, where G represents a goroutine, P an abstract processor (usually matching the number of CPU cores), and M an OS thread. The model connects G to M via P, enabling user‑space scheduling and reducing system‑call overhead.
G represents a goroutine that occupies only a few kilobytes of memory and includes its stack, instruction pointer, and other metadata such as the waiting queue for a blocked channel. P represents a processor, an abstract CPU core whose default count equals the actual core count but can be adjusted via environment variables; it maintains a local goroutine queue and executes goroutines. M represents a machine, i.e., an OS thread that must be bound to a P to run a goroutine. When an M blocks, the runtime creates or reuses another M to keep the number of Ps equal to GOMAXPROCS , fully utilizing CPU resources.
The scheduler uses a work‑stealing algorithm: each P has a local run queue ( runq) and a global run queue. When a P's local queue is empty, it steals half of another P's queue, balancing load dynamically.
runq Data Structure
The runq is a lock‑free circular queue of fixed length 256, stored inside the p struct. Important fields include:
type p struct {
id int32
status uint32 // one of pidle/prunning/...
runqhead uint32
runqtail uint32
runq [256]guintptr
runnext guintptr
// ... other fields omitted for brevity
} runqheadpoints to the head of the queue, runqtail to the tail, and each element of runq holds a guintptr (an alias of uintptr) that stores a pointer to a g (goroutine).
runqput
runqputinserts a goroutine into the local run queue without blocking. The implementation first checks whether the runnext slot should be used; if so, the goroutine is placed there, otherwise it is appended to the tail of runq. If the queue is full, the slow path runqputslow moves half of the local queue to the global queue.
// runqput tries to place g on the local runnable queue.
func runqput(pp *p, gp *g, next bool) {
if !haveSysmon && next {
next = false // avoid runnext when sysmon is absent
}
if randomizeScheduler && next && randn(2) == 0 {
next = false
}
if next {
// fast path for runnext
retryNext:
oldnext := pp.runnext
if !pp.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
goto retryNext
}
if oldnext == 0 {
return
}
gp = oldnext.ptr()
}
// main insertion into runq
retry:
h := atomic.LoadAcq(&pp.runqhead) // ① load head
t := pp.runqtail
if t-h < uint32(len(pp.runq)) { // ② queue not full
pp.runq[t%uint32(len(pp.runq))].set(gp) // ③ store g
atomic.StoreRel(&pp.runqtail, t+1) // ④ advance tail
return
}
if runqputslow(pp, gp, h, t) { // ⑤ slow path when full
return
}
goto retry
}The comments ①‑⑤ correspond to the author's original annotations, showing how the head is loaded, the fullness check, the store, the tail update, and the fallback to runqputslow.
runqputslow
If the local queue is full, runqputslow moves roughly half of the elements (including the new goroutine) to the global queue, which is protected by a lock, making this the performance bottleneck.
runqputbatch
runqputbatchbatches insertion of multiple goroutines, typically stolen from another P. It loops while the local queue has space, stores each goroutine, and updates the tail atomically. If the queue fills, it acquires sched.lock and pushes the remaining batch to the global queue.
func runqputbatch(pp *p, q *gQueue, qsize int) {
h := atomic.LoadAcq(&pp.runqhead) // ① load head
t := pp.runqtail
n := uint32(0)
for !q.empty() && t-h < uint32(len(pp.runq)) { // ② fill while possible
gp := q.pop()
pp.runq[t%uint32(len(pp.runq))].set(gp)
t++
n++
}
qsize -= int(n)
if randomizeScheduler {
// shuffle the newly added entries
off := func(o uint32) uint32 { return (pp.runqtail + o) % uint32(len(pp.runq)) }
for i := uint32(1); i < n; i++ {
j := cheaprandn(i + 1)
pp.runq[off(i)], pp.runq[off(j)] = pp.runq[off(j)], pp.runq[off(i)]
}
}
atomic.StoreRel(&pp.runqtail, t) // ④ update tail
if !q.empty() {
lock(&sched.lock)
globrunqputbatch(q, int32(qsize))
unlock(&sched.lock)
}
}The author highlights why atomic.LoadAcq is used for the head but not for the tail in the subsequent blockquote.
It’s worth pondering why the next line fetches the tail without atomic.LoadAcq .
runqget
runqgetremoves a goroutine from the local queue. It first checks runnext; if present, it atomically swaps it out. Otherwise it loads the head, verifies the queue is non‑empty, reads the goroutine at head % len(runq), and uses atomic.CasRel to advance the head.
func runqget(pp *p) (gp *g, inheritTime bool) {
next := pp.runnext
if next != 0 && pp.runnext.cas(next, 0) { // ① fast path for runnext
return next.ptr(), true
}
for {
h := atomic.LoadAcq(&pp.runqhead) // ② load head
t := pp.runqtail
if t == h { // ③ empty
return nil, false
}
gp := pp.runq[h%uint32(len(pp.runq))].ptr() // ④ read g
if atomic.CasRel(&pp.runqhead, h, h+1) { // ⑤ advance head
return gp, false
}
}
}runqdrain
runqdrainextracts all goroutines from a P's local queue, returning them in a gQueue. It first moves runnext if set, then atomically swaps the head and tail to obtain the range of pending entries, and finally pushes each element into the drain queue.
func runqdrain(pp *p) (drainQ gQueue, n uint32) {
oldNext := pp.runnext
if oldNext != 0 && pp.runnext.cas(oldNext, 0) {
drainQ.pushBack(oldNext.ptr())
n++
}
retry:
h := atomic.LoadAcq(&pp.runqhead) // ② load head
t := pp.runqtail
qn := t - h
if qn == 0 {
return
}
if qn > uint32(len(pp.runq)) { // ③ sanity check
goto retry
}
if !atomic.CasRel(&pp.runqhead, h, h+qn) { // ④ update head
goto retry
}
for i := uint32(0); i < qn; i++ { // ⑤ move to drainQ
gp := pp.runq[(h+i)%uint32(len(pp.runq))].ptr()
drainQ.pushBack(gp)
n++
}
return
}runqgrab
runqgrabis used by the stealing logic to take roughly half of a P's local queue. It calculates the number to steal, optionally steals runnext, copies the selected goroutines into a provided batch buffer, and atomically advances the head.
func runqgrab(pp *p, batch *[256]guintptr, batchHead uint32, stealRunNextG bool) uint32 {
for {
h := atomic.LoadAcq(&pp.runqhead)
t := atomic.LoadAcq(&pp.runqtail)
n := t - h
n = n - n/2 // ① take half
if n == 0 {
if stealRunNextG {
if next := pp.runnext; next != 0 {
if pp.status == _prunning {
if !osHasLowResTimer {
usleep(3)
} else {
osyield()
}
}
if !pp.runnext.cas(next, 0) {
continue
}
batch[batchHead%uint32(len(batch))] = next
return 1
}
}
return 0
}
if n > uint32(len(pp.runq)/2) { // ③ avoid stealing too many
continue
}
for i := uint32(0); i < n; i++ {
g := pp.runq[(h+i)%uint32(len(pp.runq))]
batch[(batchHead+i)%uint32(len(batch))] = g
}
if atomic.CasRel(&pp.runqhead, h, h+n) { // ⑤ update head
return n
}
}
}runqsteal
runqstealsteals a half‑filled batch from another P using runqgrab, then extracts one goroutine from the stolen batch and returns it. If the batch is empty, it returns nil. The function also updates the tail of the stealing P.
func runqsteal(pp, p2 *p, stealRunNextG bool) *g {
t := pp.runqtail
n := runqgrab(p2, &pp.runq, t, stealRunNextG) // ① steal half from p2
if n == 0 {
return nil
}
n--
gp := pp.runq[(t+n)%uint32(len(pp.runq))].ptr() // ② get one g
if n == 0 {
return gp
}
h := atomic.LoadAcq(&pp.runqhead) // ③ load head
if t-h+n >= uint32(len(pp.runq)) { // ④ overflow check
throw("runqsteal: runq overflow")
}
atomic.StoreRel(&pp.runqtail, t+n) // ⑤ update tail
return gp
}gQueue and gList
Go runtime uses two queue types to store goroutines: gQueue: a double‑ended queue with head and tail pointers, supporting push, pushBack, pushBackAll, pop, and popList. gList: a singly‑linked list using the schedlink field of g, offering push, pushAll, and pop.
type gQueue struct {
head guintptr
tail guintptr
}
func (q *gQueue) empty() bool { return q.head == 0 }
func (q *gQueue) push(gp *g) { gp.schedlink = q.head; q.head.set(gp); if q.tail == 0 { q.tail.set(gp) } }
func (q *gQueue) pushBack(gp *g) { gp.schedlink = 0; if q.tail != 0 { q.tail.ptr().schedlink.set(gp) } else { q.head.set(gp) }; q.tail.set(gp) }
func (q *gQueue) pushBackAll(q2 gQueue) { if q2.tail == 0 { return }; q2.tail.ptr().schedlink = 0; if q.tail != 0 { q.tail.ptr().schedlink = q2.head } else { q.head = q2.head }; q.tail = q2.tail }
func (q *gQueue) pop() *g { gp := q.head.ptr(); if gp != nil { q.head = gp.schedlink; if q.head == 0 { q.tail = 0 } }; return gp }
func (q *gQueue) popList() gList { stack := gList{q.head}; *q = gQueue{}; return stack }
type gList struct { head guintptr }
func (l *gList) empty() bool { return l.head == 0 }
func (l *gList) push(gp *g) { gp.schedlink = l.head; l.head.set(gp) }
func (l *gList) pushAll(q gQueue) { if !q.empty() { q.tail.ptr().schedlink = l.head; l.head = q.head } }
func (l *gList) pop() *g { gp := l.head.ptr(); if gp != nil { l.head = gp.schedlink }; return gp }These implementations can be compared with textbook lock‑free queue designs to see how Go adapts them for the scheduler.
Global runq
The global runnable queue ( sched.runq) handles overflow when a P's local queue is exhausted. It is protected by sched.lock, making it simpler but slower than the lock‑free local queues.
var (
sched schedt
)
type schedt struct {
// ... other fields omitted
runq gQueue
runqsize int32
}
func globrunqput(gp *g) {
assertLockHeld(&sched.lock)
sched.runq.pushBack(gp)
sched.runqsize++
}
func globrunqputhead(gp *g) {
assertLockHeld(&sched.lock)
sched.runq.push(gp)
sched.runqsize++
}
func globrunqputbatch(batch *gQueue, n int32) {
assertLockHeld(&sched.lock)
sched.runq.pushBackAll(*batch)
sched.runqsize += n
*batch = gQueue{}
}
func globrunqget(pp *p, max int32) *g {
assertLockHeld(&sched.lock)
if sched.runqsize == 0 { return nil }
n := sched.runqsize/gomaxprocs + 1
if n > sched.runqsize { n = sched.runqsize }
if max > 0 && n > max { n = max }
if n > int32(len(pp.runq))/2 { n = int32(len(pp.runq))/2 }
sched.runqsize -= n
gp := sched.runq.pop()
n--
for ; n > 0; n-- {
gp1 := sched.runq.pop()
runqput(pp, gp1, false)
}
return gp
}The author explains the calculation of n (how many goroutines to pull from the global queue) and the guard against stealing more than half of a local queue, preserving fairness.
Overall, the article walks through the complete reasoning chain—from the high‑level GPM model, through the concrete lock‑free runq implementation, to the interactions with the global queue—providing code‑level insight into why Go can achieve high concurrency with minimal contention.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
BirdNest Tech Talk
Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
