Why Go’s lock‑free PoolDequeue outperforms channels by 10×

This article examines Go’s internal lock‑free single‑producer multi‑consumer queues—PoolDequeue and its dynamic extension PoolChain—detailing their design, atomic operations, and benchmark comparisons that show they can be up to ten times faster than standard channels in a producer‑consumer workload.

BirdNest Tech Talk
BirdNest Tech Talk
BirdNest Tech Talk
Why Go’s lock‑free PoolDequeue outperforms channels by 10×

Producer‑Consumer Modes

The classic producer‑consumer pattern can be classified into four modes:

SPSC – single‑producer single‑consumer

SPMC – single‑producer multiple‑consumer

MPSC – multiple‑producer single‑consumer

MPMC – multiple‑producer multiple‑consumer

Go's channel implements the MPMC mode, while sync.Pool uses a lock‑free structure optimized for the SPMC case.

PoolDequeue

PoolDequeue

is a fixed‑size lock‑free queue built on a ring buffer. It requires exactly one producer; any number of consumers may operate concurrently.

Internal representation:

type poolDequeue struct {
    headTail atomic.Uint64 // high 32 bits = head, low 32 bits = tail
    vals     []eface       // length must be a power of two
}

The headTail field packs the head index in the high 32 bits and the tail index in the low 32 bits. Because len(vals) is a power of two, modulo operations are performed with a bitwise & instead of the slower %.

Producer operation – pushHead

func (d *poolDequeue) pushHead(val any) bool {
    ptrs := d.headTail.Load()
    head, tail := d.unpack(ptrs)
    // full when advancing head would meet tail
    if (tail+uint32(len(d.vals)))&(1<<dequeueBits-1) == head {
        return false // queue full
    }
    slot := &d.vals[head&uint32(len(d.vals)-1)]
    // slot must be empty (type pointer nil) before we write
    if typ := atomic.LoadPointer(&slot.typ); typ != nil {
        return false // another goroutine is cleaning tail
    }
    if val == nil {
        val = dequeueNil(nil) // special sentinel for nil
    }
    *(*any)(unsafe.Pointer(slot)) = val // ① write value
    d.headTail.Add(1 << dequeueBits)    // store barrier, advance head
    return true
}

Only the single producer writes to a slot, so the write at ① is safe without additional synchronization.

Consumer operation – popTail

func (d *poolDequeue) popTail() (any, bool) {
    var slot *eface
    for {
        ptrs := d.headTail.Load()
        head, tail := d.unpack(ptrs)
        if tail == head {
            return nil, false // empty queue
        }
        // attempt to claim the next tail position
        ptrs2 := d.pack(head, tail+1)
        if d.headTail.CompareAndSwap(ptrs, ptrs2) {
            slot = &d.vals[tail&uint32(len(d.vals)-1)]
            break
        }
        // spin‑loop (②) – may become a bottleneck under heavy contention
    }
    val := *(*any)(unsafe.Pointer(slot))
    if val == dequeueNil(nil) {
        val = nil
    }
    slot.val = nil                     // ③ release value
    atomic.StorePointer(&slot.typ, nil) // ④ release type pointer
    return val, true
}

The CAS in the loop atomically increments the tail; after a successful claim the consumer reads the value and clears the slot (③, ④) so the producer can reuse it.

Producer‑only operation – popHead

popHead

mirrors popTail but decrements the head index, allowing the producer to retrieve an element it has pushed but that has not yet been consumed.

PoolChain

PoolChain

builds an unbounded queue by chaining multiple PoolDequeue instances. The head side is accessed only by the producer, the tail side only by consumers, and the linking pointers are updated atomically.

type poolChain struct {
    head *poolChainElt               // producer‑only, no synchronization
    tail atomic.Pointer[poolChainElt] // consumer‑only, atomic ops
}

type poolChainElt struct {
    poolDequeue                     // embedded fixed‑size deque
    next, prev atomic.Pointer[poolChainElt] // next written by producer, read by consumers; prev written by consumers, read by producer
}

When the consumer's current poolDequeue becomes empty, it moves to the next element in the chain. When the producer's head deque fills, a new poolDequeue is allocated and linked, giving the overall structure unlimited capacity.

Full implementation can be inspected at:

https://github.com/golang/go/blob/master/src/sync/poolqueue.go#L220-L302

Benchmark comparison with channel

Benchmarks compare a single producer against ten concurrent consumers for three queue types: PoolDequeue, PoolChain, and a buffered channel. The benchmark code:

func BenchmarkPoolDequeue(b *testing.B) {
    const size = 1024
    pd := NewPoolDequeue(size)
    var wg sync.WaitGroup
    wg.Add(11) // 1 producer + 10 consumers
    go func() {
        for i := 0; i < b.N; i++ {
            pd.PushHead(i)
        }
        wg.Done()
    }()
    for i := 0; i < 10; i++ {
        go func() {
            for {
                if _, ok := pd.PopTail(); !ok {
                    break
                }
            }
            wg.Done()
        }()
    }
    wg.Wait()
}

func BenchmarkPoolChain(b *testing.B) {
    pc := NewPoolChain()
    var wg sync.WaitGroup
    wg.Add(11)
    go func() {
        for i := 0; i < b.N; i++ {
            pc.PushHead(i)
        }
        wg.Done()
    }()
    for i := 0; i < 10; i++ {
        go func() {
            for {
                if _, ok := pc.PopTail(); !ok {
                    break
                }
            }
            wg.Done()
        }()
    }
    wg.Wait()
}

func BenchmarkChannel(b *testing.B) {
    ch := make(chan interface{}, 1024)
    var wg sync.WaitGroup
    wg.Add(11)
    go func() {
        for i := 0; i < b.N; i++ {
            ch <- i
        }
        close(ch)
        wg.Done()
    }()
    for i := 0; i < 10; i++ {
        go func() {
            for range ch {
            }
            wg.Done()
        }()
    }
    wg.Wait()
}

Results (averaged over multiple runs) show: PoolDequeue ≈ 10× faster than the buffered channel. PoolChain ≈ 5× faster than the channel but about half the throughput of PoolDequeue because of the overhead of managing the linked list of deques.

The accompanying chart visualises the speedup:

Performance comparison chart
Performance comparison chart

These measurements confirm that Go's internal lock‑free queues provide a substantial latency and throughput advantage for the single‑producer multi‑consumer pattern, and that PoolChain offers a scalable alternative when unbounded capacity is required.

For experimentation, the author duplicated the implementation in a separate repository with unit tests and additional benchmarks:

https://github.com/smallnest/exp/blob/master/gods/poolqueue.go

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceconcurrencyGoBenchmarklock‑freesync.PoolQueue
BirdNest Tech Talk
Written by

BirdNest Tech Talk

Author of the rpcx microservice framework, original book author, and chair of Baidu's Go CMC committee.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.