Fundamentals 19 min read

Understanding Linux Futex: Mechanism, APIs, and Priority Inheritance

Linux futexes provide a fast userspace lock backed by a kernel wait queue, using atomic operations when uncontended and system calls such as FUTEX_WAIT, FUTEX_WAKE, and priority‑inheritance variants to manage contention, requeueing, and priority inversion via rt‑mutex structures.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Understanding Linux Futex: Mechanism, APIs, and Priority Inheritance

Futex (Fast Userspace muTEX) is a kernel‑provided synchronization primitive introduced in Linux 2.5.7 by Rusty Russell, Hubertus Franke, and Mathew Kirkwood. It consists of a kernel‑space wait queue and a 32‑bit user‑space futex word (the same size on all architectures). In uncontended cases, lock acquisition and release are performed entirely in userspace via atomic operations, avoiding any kernel entry.

When contention occurs, a thread that cannot acquire the futex is placed on the kernel wait queue (incurring a system‑call). The owner, upon releasing the lock, checks the wait queue and wakes blocked tasks via a system call. If there is no contention, the kernel is never entered.

The futex system‑call interface is defined by several operation codes:

1. FUTEX_WAIT : Sleep if the futex word still holds the expected value.

2. FUTEX_WAKE : Wake up to val tasks waiting on the futex word (either 1 or INT_MAX).

3. FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET : Variants of WAIT and WAKE that include a mask to select specific waiters.

4. FUTEX_LOCK_PI and FUTEX_UNLOCK_PI : Operations for priority‑inheritance (PI) futexes.

5. FUTEX_REQUEUE and FUTEX_CMP_REQUEUE : Wake a number of waiters on one address and move the remaining waiters to another futex’s wait queue, optionally checking a compare value.

The article details the internal data structures: each waiting task has a futex_q object, and futexes are organized into hash buckets ( futex_hash_bucket ) based on a hash key derived from a three‑tuple (process address space, address, flags) for private futexes or (inode sequence, page index, offset) for shared futexes.

Futex wait flow :

1. If a timeout is supplied, a high‑resolution timer (hrtimer) is created.

2. Compute the futex hash key using the appropriate three‑tuple.

3. Locate the corresponding hash bucket and lock it.

4. Verify that the user‑space futex word still holds the expected value; otherwise return EWOULDBLOCK .

5. Insert the task’s futex_q into the bucket’s list, respecting priority (RT tasks are ordered by priority, CFS tasks use FIFO).

6. If a timeout is set, arm the hrtimer.

7. Re‑check the futex state before actually sleeping to avoid missed wake‑ups.

8. After waking (normal wake, timeout, or signal), clean up the timer and remove the task from the queue.

9. Handle spurious wake‑ups or signal interruptions, possibly returning ERESTARTSYS for the latter.

Futex wake flow :

1. Find the hash bucket via the same hash key.

2. Iterate the bucket’s list to locate matching futex_q objects (full three‑tuple match).

3. Apply optional bitset filtering (e.g., FUTEX_BITSET_MATCH_ANY ).

4. Move selected waiters to a temporary wake‑up queue while holding the bucket’s spin lock.

5. After scanning, invoke wake_up_q to wake each task.

Requeue operation moves waiters from one futex’s queue to another, reducing the number of system calls needed for typical wait‑notify patterns (e.g., Java’s monitor + condition variable).

Priority Inheritance (PI) futexes address priority inversion by temporarily boosting the priority of the lock owner to that of the highest‑priority waiter. The kernel tracks ownership and waiting tasks via rt_mutex_waiter and futex_pi_state structures, and maintains a red‑black tree of waiters sorted by priority.

RT mutex (priority‑inheritance aware mutex) underlies PI futexes. Each rt_mutex_waiter links a task to the mutex; the mutex’s waiters tree’s left‑most node is the top waiter. Tasks also keep a pi_waiters tree of all top waiters for locks they hold, enabling fast priority adjustments across the PI chain.

The article concludes that normal futexes use FUTEX_WAIT / FUTEX_WAKE , while PI futexes use FUTEX_LOCK_PI / FUTEX_UNLOCK_PI , delegating the complex PI handling to the underlying rt‑mutex implementation.

KernelLinuxSynchronizationFutexpriority inheritancert mutex
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.