Understanding Multi‑core Hardware Architecture and Linux sched_domain
The article explains how Linux builds sched_domain and sched_group hierarchies based on physical CPU topology—sockets, dies, clusters, and NUMA nodes—illustrating load‑balancing (BALANCE) versus affinity (AFFINE) with concrete examples, kernel code references, and QEMU‑based experiments.
Linux organizes its scheduler using a hierarchical topology called sched_domain , which mirrors the physical layout of a multi‑core system (sockets, dies, clusters, NUMA nodes, and SMT threads). The article first draws an analogy to China’s administrative divisions to illustrate how resources can be grouped at different levels.
Two main goals drive the scheduling decisions:
Load balancing (BALANCE) : distribute tasks so that all CPUs, caches, memory, and bus bandwidth are fully utilized.
Affinity (AFFINE) : keep tasks that communicate frequently on nearby CPUs to reduce cache‑coherency traffic and memory latency.
For a four‑CPU example (CPU0‑CPU3) where CPU0 and CPU1 share an L2 cache and CPU2 and CPU3 share another L2 cache, a pure BALANCE approach would scatter the four tasks (named after characters from "The Return of the Condor Heroes") across the CPUs, causing high inter‑task communication cost. Combining BALANCE with AFFINE leads to a placement that keeps tightly‑coupled tasks on the same cache domain.
The hardware hierarchy is described in detail:
Socket : a physical CPU package on the motherboard (e.g., a Kunpeng 920 with two sockets).
Die : a silicon chip inside a package; each Kunpeng 920 die contains up to 32 cores.
MC (Multi‑Core Cache) : a group of cores sharing an L2/L3 cache.
Cluster : a set of cores that share a cache level.
SMT : hyper‑threading where a physical core presents multiple logical CPUs.
Linux builds the topology in kernel/sched/topology.c via the function build_sched_domains(), which creates sched_domain and sched_group objects and attaches each CPU to the appropriate domain.
Enabling detailed scheduler debugging requires setting CONFIG_SCHED_DEBUG =y. With sched_verbose on the kernel command line, the kernel prints the constructed hierarchy. An example output shows CPU0 belonging to three nested domains (SMT, MC, NUMA) and the root domain spanning all CPUs (CPU0‑CPU15).
The article also demonstrates a QEMU‑based NUMA simulation: a virtual machine with two sockets, each socket having four cores and two SMT threads (total 16 logical CPUs), and each NUMA node equipped with 2 GB DDR. The printed debug information confirms the expected domain spans.
Affinity is enforced through cluster_sibling (mid‑level cache sharing) and llc_sibling (last‑level cache sharing), ensuring that tasks with frequent communication are placed on sibling CPUs. NUMA‑aware scheduling further avoids placing a task on a node while its memory resides on a remote node, reducing cross‑node memory latency. The kernel’s NUMA_BALANCING feature dynamically migrates tasks and memory across NUMA nodes to minimize such penalties.
Overall, the article provides a step‑by‑step walkthrough of how Linux maps hardware topology to scheduler structures, the rationale behind BALANCE vs. AFFINE decisions, and practical ways to observe and tune the behavior on real or simulated hardware.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Linux Code Review Hub
A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
