Understanding Multi‑Core Hardware Topology and Linux sched_domain
The article explains how Linux kernel scheduling uses a hierarchical topology—balancing load and preserving cache affinity—by mapping real‑world multi‑core hardware structures such as sockets, dies, clusters, and NUMA nodes to sched_domain and sched_group, and shows how to inspect and tune this layout with CONFIG_SCHED_DEBUG and QEMU simulation.
Linux scheduling treats the physical layout of a multi‑core system as a hierarchical topology, similar to how geographic regions form a map. Two key goals are load balancing (BALANCE) – spreading tasks across all resources – and affinity (AFFINE) – keeping tightly communicating tasks on nearby cores to reduce cache‑coherency latency.
Using a four‑CPU example, the article shows a naïve BALANCE‑only placement where tasks are scattered, then presents an optimized layout that combines BALANCE and AFFINE so that tasks with high communication stay within the same cache domain.
The hardware hierarchy is described in detail:
Socket : a physical CPU package; a board may have multiple sockets.
Die : a silicon chip inside a package; a Kunpeng 920 package contains two dies, each with up to 32 cores.
MC (Multi‑Core Cache) : a group of cores sharing an L2/L3 cache; on Kunpeng 920 each die forms an MC.
Cluster : a subset of cores that share a specific cache level.
Core : a logical processing unit, possibly with SMT (hyper‑threading) threads.
Different CPUs follow different hierarchies; for example, Intel X86 Jacobsville processors expose a 4‑core package where each core shares an L2 cache.
In the kernel, build_sched_domains() (found in kernel/sched/topology.c) builds sched_domain and sched_group structures according to sched_domain_topology_level. A sched_domain represents a level of the hierarchy (e.g., NUMA node, MC, cluster, core, SMT), while a sched_group is the load‑balancing unit inside a domain.
Enabling the debug option with CONFIG_SCHED_DEBUG = y and passing sched_verbose on the kernel command line makes the kernel print the constructed topology. The article demonstrates this with a QEMU‑emulated NUMA system (2 sockets, 4 cores per socket, 2 SMT threads per core, 2 GB DDR per node) and shows the printed spans for each level, including the root domain covering CPUs 0‑15.
Affinity is enforced via cluster_sibling (mid‑level cache sharing) and llc_sibling (last‑level cache sharing), placing frequently communicating tasks together. NUMA‑aware scheduling is handled by NUMA_BALANCING, which migrates tasks or memory across NUMA nodes to minimise cross‑node memory latency.
Images illustrate the geographic analogy, hardware hierarchy, QEMU topology, and kernel debug output.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
