Which CPU Does a Woken Task Run On? Understanding wake_affine and select_idle_sibling
The article explains how the Linux kernel decides the CPU for a newly woken task, detailing the roles of wake_affine and select_idle_sibling, the influence of cache topology, run‑queue load, and various idle‑checking heuristics, with concrete examples and code snippets.
In the Linux kernel, when task A wakes task B, A is the "waker" and B the "wakee". The wakee must be placed on a CPU, and the kernel uses several heuristics to decide which CPU is optimal.
Because wake‑ups often involve communication (e.g., writing to a pipe, socket, or shared memory), the scheduler prefers to run the wakee on a CPU that is close to the waker or shares cache with it, improving hot‑cache hit probability. The decision also considers the idle status of the waker’s current CPU (this_cpu) and the wakee’s previous CPU (prev_cpu), as well as the topology (shared L2/L3 caches) and run‑queue load.
Example topology: three clusters (Wuchang, Hankou, Hanyang), each with four CPUs sharing an L2 cache, while the whole system shares an L3 cache. This models cpus_share_cache() and cpus_share_resources() relationships.
When A (on this_cpu) wakes B (previously on prev_cpu), the kernel runs try_to_wake_up(), which calls select_task_rq(). For the fair scheduling class this invokes select_task_rq_fair(), which in turn calls wake_affine(). wake_affine() chooses a target CPU (new_cpu) based on either the waker’s CPU or the wakee’s previous CPU.
After selecting new_cpu, the kernel calls select_idle_sibling(p, prev_cpu, new_cpu) to find an idle sibling CPU near the target. The function first checks whether the waker is an interrupt ( available_idle_cpu(this_cpu)) and whether the wake‑up is synchronous ( cpu_rq(this_cpu)->nr_running == 1). If the waker’s CPU is idle, the wakee is biased toward this_cpu; otherwise, if prev_cpu is idle, the wakee may stay on prev_cpu.
The algorithm then evaluates load and capability: it compares this_cpu and prev_cpu, applying a slight bias toward this_cpu by multiplying its effective load by 100, while prev_cpu’s load is multiplied by 100 + (sd->imbalance_pct - 100) / 2. This makes the scheduler lean toward the waker’s CPU when loads are comparable.
If neither this_cpu nor prev_cpu is idle, wake_affine() falls back to select_idle_cpu(), which scans idle siblings first within the same cluster as the target, then in other clusters, and finally across the whole LLC domain. The scanning order respects hardware topology to minimize cache misses.
The kernel also considers recent_used_cpu (the wakee’s last CPU) when it is idle and shares cache with the target, preferring it under certain conditions.
When many tasks are mutually waking each other, pulling them onto the same small sched_domain can cause congestion. The kernel detects this via the waker’s and wakee’s own wake‑up patterns and may disable wake_affine to avoid over‑concentration.
In summary, for a wake‑up from a task on Wuchang (A) to a task on Hankou (B), the scheduler tends to migrate B toward Wuchang or keep it on Hankou, scanning first the local cluster and then the other two clusters. Normal wake‑ups rarely migrate tasks to distant clusters (e.g., Jingshou or Xiangyang) unless the wake‑up occurs in WF_EXEC or WF_FORK paths, which invoke the slow‑path load‑balancing routine sched_balance_find_dst_cpu().
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
