Fundamentals 15 min read

Inside Linux’s Queued Spinlock: Architecture, Data Structures, and Implementation

This article provides a detailed analysis of Linux kernel spinlocks, focusing on the evolution from ticket‑based locks to the modern queued spinlock (qspinlock), explaining its code hierarchy, data structures, state machine, and the acquisition/release algorithms with accompanying diagrams.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Inside Linux’s Queued Spinlock: Architecture, Data Structures, and Implementation

Linux kernels use spinlocks as a lightweight mutual‑exclusion primitive that busy‑waits instead of sleeping, making them suitable for interrupt contexts. Modern kernels have replaced the older test‑and‑set (TAS) and ticket‑based spinlocks with the queued spinlock (qspinlock) to improve fairness and reduce cache‑line bouncing.

Code hierarchy

The spinlock implementation is organized in three layers:

Top layer : architecture‑independent generic API providing both spinlock and raw_spinlock interfaces. When PREEMPT_RT is enabled, spinlock maps to a priority‑inherited sleeping lock, while raw_spinlock retains the classic spin‑only behavior.

Middle layer : separates SMP and UP implementations. On UP systems the lock collapses to preempt_disable()/preempt_enable(); on SMP platforms the lock contains real spinning logic that depends on the CPU architecture.

Bottom layer : architecture‑specific code. On ARM64 the implementation is the qspinlock, abstracted in kernel/locking/qspinlock.c.

Spinlock code hierarchy
Spinlock code hierarchy

Qspinlock data structures

A qspinlock occupies four bytes and can be interpreted in several ways. Its layout consists of: locked (bit 0): indicates the lock is held. pending (bit 1): used during hand‑off and while threads are queued. tail (16 bits on systems with ≤16 k CPUs, larger on bigger systems): encodes a queue index and the CPU identifier of the last waiter.

When the number of CPUs exceeds 16 k, the tail field expands and the pending bit is compressed to a single bit, leaving the remaining bits for the extended tail.

Qspinlock byte layout
Qspinlock byte layout

To reduce cache‑line bouncing, each CPU hosts up to four per‑CPU MCS lock nodes, one for each possible nesting context (thread, soft‑irq, hard‑irq, NMI). These nodes form a linked list that serializes waiters without causing false sharing.

MCS lock per‑CPU nodes
MCS lock per‑CPU nodes

State representation

The qspinlock state is a triple (locked, pending, tail). Typical states include:

Only locked set – a single owner. pending set without a tail – hand‑off mode where the owner is about to transfer the lock. pending plus a non‑zero tail – one or more waiters queued in the per‑CPU MCS list.

Qspinlock state diagram
Qspinlock state diagram

Acquisition algorithm

Acquiring a qspinlock follows two paths:

Fast path : If the 4‑byte value val is zero, the thread atomically writes 1 (setting locked) and obtains the lock.

Slow path : When the fast path fails, the algorithm examines pending and tail to decide whether to join the MCS queue or spin on the lock.

If only pending is set, the lock is in hand‑off mode; the thread retries a bounded number of times ( _Q_PENDING_LOOPS) before falling back to the queue.

If other waiters are already present ( tail non‑zero) or pending is set, the thread prepares its per‑CPU MCS node, computes a combined tail value (CPU ID + context index), and enqueues itself.

Key helper functions used in the slow path are: queued_fetch_set_pending_acquire() – atomically sets the pending bit and returns the previous lock value. atomic_cond_read_acquire() – spins on the locked bit with a conditional read.

WFE‑based wait – puts the CPU into a low‑power state until the owner issues an event.

Acquire fast path
Acquire fast path

MCS lock integration

If a thread cannot acquire the lock via the pending bit, it enqueues its MCS node:

Obtain the per‑CPU base address of the MCS node array.

Select the node corresponding to the current nesting context (thread, soft‑irq, hard‑irq, NMI).

Compose the new tail value from cpu_id+1 (zero indicates an empty queue) and the context index.

Atomically exchange the lock’s tail with the new value, receiving the previous tail as the predecessor.

If a predecessor exists, link the new node to it and spin on the predecessor’s locked flag; otherwise the thread becomes the first waiter and proceeds to spin on the lock’s locked bit.

MCS node enqueue
MCS node enqueue

Release algorithm

Releasing a qspinlock is performed by queued_spin_unlock(), which clears the locked bit. If a waiter is present in the MCS queue, the unlock operation also wakes the next node by setting its locked flag.

Unlock
Unlock

Reference implementation

The full source code for the qspinlock implementation can be examined in the Linux 5.10.123 tree:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/?h=v5.10.123

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

concurrencyKernelLinuxSynchronizationSpinlockMCS lockqspinlock
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.