Understanding Linux rwsem: Mechanisms, Data Structures, and Optimistic Spinning
Linux rwsem is a read‑write semaphore that lets multiple readers or a single writer hold the lock, using a count field, owner pointer, optimistic‑spin queue (osq), and protected wait list; it provides fast, medium, and slow acquisition paths with handoff and optimistic spinning to reduce sleep latency compared to mutexes and spinlocks.
Recently, many performance issues caused by D-state stalls were traced back to waiting on a mutex or rwsem lock in the kernel. While mutexes are simple (only one thread can hold the lock), rwsems are more complex because multiple threads may hold the lock simultaneously, leading to long sleep times for waiting threads. Moreover, rwsem lock hand‑off is only possible when a writer holds the lock; readers cannot hand off the lock.
The Linux kernel provides several synchronization primitives—semaphore, spinlock, mutex, RCU, atomic operations—to resolve resource contention in a preemptive, multi‑tasking environment. Each mechanism has its own implementation details and performance characteristics, so choosing the right one for a given scenario is crucial.
Mutex (mutual exclusion lock): a) Only one thread can hold the lock at any moment. b) If a thread fails to acquire the lock, it sleeps.
Spinlock : a) Only one thread can hold the lock at any moment. b) If acquisition fails, the thread spins (busy‑waits) instead of sleeping. Spinlocks avoid the scheduling overhead of sleep/wakeup but are unsuitable for long critical sections because they disable preemption or interrupts while holding the lock.
In interrupt contexts, sleeping is prohibited, so spinlocks must be used instead of mutexes.
rwsem (read‑write semaphore) behaves like a mutex but allows multiple concurrent readers while ensuring exclusive access for writers. Its key characteristics are:
a) Multiple readers can hold the lock simultaneously.
b) Only one writer can hold the lock, and writers are mutually exclusive.
c) Readers and writers cannot hold the lock at the same time.
d) On lock acquisition failure, the task sleeps.
The rwsem data structure (kernel 5.10) consists of four main fields:
count : Encodes the rwsem state using several bits (e.g., reader count, writer flag, waiters flag). Macros manipulate these bits.
owner : Records the task that currently holds the lock (meaningful only for writers). The low three bits of the task pointer are repurposed for flags such as RWSEM_FLAG_WAITERS .
osq (optimistic spin queue): When CONFIG_RWSEM_SPIN_ON_OWNER is enabled, rwsem supports optimistic spinning. osq holds an MCS lock queue; a non‑zero tail indicates pending optimistic spin tasks.
wait_list and wait_lock : When a thread cannot acquire the lock and optimistic spinning is not applicable, it is placed on wait_list . wait_lock (a spinlock) protects this list.
Optimistic Spinning (osq) allows a task to spin on the current owner instead of sleeping, reducing wake‑up latency. The osq works as a double‑linked list of per‑CPU MCS lock nodes. When the head of the osq acquires the lock, it hands the lock to the next node, preventing cache‑line bouncing.
Read‑lock acquisition paths :
Fast path : If no writer holds the lock and the NONSPINNABLE flag is clear, the reader acquires the lock immediately.
Medium path : Handles cases where the NONSPINNABLE flag is set or the reader count overflows. Functions such as rwsem_set_nonspinnable are invoked.
Slow path : The reader is placed on wait_list and sleeps until awakened by rwsem_mark_wake .
Write‑lock acquisition paths mirror the read paths:
Fast path : Direct atomic_long_try_cmpxchg_acquire succeeds only when the lock is completely free.
Medium path : Same implementation as the reader’s medium path.
Slow path : Involves the handoff mechanism ( RWSEM_FLAG_HANDOFF ) to prevent starvation of the first waiter in the wait list. If the flag is set, the kernel attempts an optimistic spin before sleeping.
The handoff flag is set in two main scenarios:
During the slow‑path wake‑up of a writer when the first waiter has waited too long.
When a writer, after being woken, finds the lock still held and wants to avoid being pre‑empted by an optimistic spinner.
Finally, the article provides a concise recommendation:
For a high‑level understanding, study the rwsem structure, the concept of osq, and the possible states during reader/writer interactions.
For deep comprehension, read the kernel source code (e.g., kernel/locking/rwsem.c in Linux 5.10) and trace the execution paths.
OPPO Kernel Craftsman
Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.