Inside Linux 5.4 Scheduler: How the Kernel Initializes CFS and Multi‑Core Scheduling
This article explains the Linux kernel scheduler's core concepts, walks through the initialization code of the 5.4 kernel (including run queues, scheduling classes, domains, and groups), and details how multi‑core and NUMA topologies are handled to achieve balanced CPU usage.
Basic Concepts
Run queue (rq)
Scheduling class (sched_class)
Scheduling domain (sched_domain)
Scheduling group (sched_group)
Root domain (root_domain)
Group scheduling (group_sched)
Run queue (rq)
Each CPU has its own run queue. All runnable tasks are placed on the per‑CPU rq, and the scheduler selects tasks from the rq according to the active scheduling class.
Scheduling class (sched_class)
The kernel abstracts scheduling policies into sched_class structures, decoupling common scheduler mechanisms from class‑specific algorithms. Adding a new class requires only a few function implementations.
stop_sched_class : highest‑priority class used for emergency tasks such as active balancing; only the migration thread may run with this class.
dl_sched_class : real‑time deadline scheduler based on the Earliest Deadline First (EDF) algorithm, ranked just below stop_sched_class.
rt_sched_class : priority‑based real‑time scheduler, lower in priority than the deadline class.
fair_sched_class : default class for the Completely Fair Scheduler (CFS), providing fair time‑sharing among tasks.
idle_sched_class : used by the idle (swapper) thread to put the CPU into low‑power states via cpuidle/nohz.
Scheduling domain (sched_domain)
Introduced in kernel 2.6, multi‑level domains map the hardware topology (SMT, core, socket, NUMA) to enable load‑balancing that respects cache hierarchies and NUMA latency.
Scheduling group (sched_group)
Works with sched_domain to perform load‑balancing across groups of CPUs, e.g., a NUMA node or a socket.
Root domain (root_domain)
Handles load‑balancing for real‑time classes (dl and rt). By default all CPUs belong to a single root domain unless isolation or cpuset changes are applied.
Group scheduling (group_sched)
Implemented via cgroup, allowing administrators to control CPU bandwidth and shares for a set of tasks, similar to container resource management.
Scheduler Initialization (sched_init)
The initialization sequence starts after memory setup, within start_kernel. sched_init creates per‑CPU run queues, configures global rt/dl bandwidth, registers the CFS soft‑irq, and sets up group scheduling structures.
void __init sched_init(void)
{
unsigned long ptr = 0;
int i;
/* Initialize global rt and dl bandwidth control */
init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
init_dl_bandwidth(&def_dl_bandwidth, global_rt_period(), global_rt_runtime());
#ifdef CONFIG_SMP
/* Initialize default root domain for real‑time load‑balancing */
init_defrootdomain();
#endif
#ifdef CONFIG_RT_GROUP_SCHED
init_rt_bandwidth(&root_task_group.rt_bandwidth, global_rt_period(), global_rt_runtime());
#endif
/* Initialize each CPU's run queue */
for_each_possible_cpu(i) {
struct rq *rq;
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
init_cfs_rq(&rq->cfs);
init_rt_rq(&rq->rt);
init_dl_rq(&rq->dl);
#ifdef CONFIG_FAIR_GROUP_SCHED
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
#endif
rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
#ifdef CONFIG_RT_GROUP_SCHED
init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
#endif
#ifdef CONFIG_SMP
rq_attach_root(rq, &def_root_domain);
#endif
}
init_sched_fair_class();
scheduler_running = 1;
}Multi‑core Scheduler Initialization (sched_init_smp)
Before initializing SMP domains, all secondary CPUs are brought up. The scheduler then builds hierarchical domains that mirror the physical topology (SMT, MC, NUMA).
Scheduling Domain Implementation
Each level of the hardware topology becomes a sched_domain. Load‑balancing selects the busiest sched_group within a domain and migrates tasks to a less loaded group, respecting cache affinity and NUMA costs.
Physical Topology Example
A 4‑core, 8‑thread NUMA system (2 sockets, each with 2 cores, each core with 2 hyper‑threads) illustrates three levels:
SMT domain – the two hyper‑threads of a core.
MC domain – the two cores of a socket sharing the last‑level cache.
NUMA domain – the two sockets.
Domain Construction Functions
static int build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
{
enum s_alloc alloc_state = sa_none;
struct sched_domain *sd;
struct s_data d;
struct rq *rq = NULL;
int i, ret = -ENOMEM;
struct sched_domain_topology_level *tl_asym;
bool has_asym = false;
if (WARN_ON(cpumask_empty(cpu_map)))
goto error;
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
if (alloc_state != sa_rootdomain)
goto error;
tl_asym = asym_cpu_capacity_level(cpu_map);
for_each_cpu(i, cpu_map) {
struct sched_domain_topology_level *tl;
sd = NULL;
for_each_sd_topology(tl) {
int dflags = 0;
if (tl == tl_asym) {
dflags |= SD_ASYM_CPUCAPACITY;
has_asym = true;
}
sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
if (tl == sched_domain_topology)
*per_cpu_ptr(d.sd, i) = sd;
if (tl->flags & SDTL_OVERLAP)
sd->flags |= SD_OVERLAP;
if (cpumask_equal(cpu_map, sched_domain_span(sd)))
break;
}
}
/* Build groups for each domain */
for_each_cpu(i, cpu_map) {
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
sd->span_weight = cpumask_weight(sched_domain_span(sd));
if (sd->flags & SD_OVERLAP) {
if (build_overlap_sched_groups(sd, i))
goto error;
} else {
if (build_sched_groups(sd, i))
goto error;
}
}
}
/* Calculate CPU capacity for packages and nodes */
for (i = nr_cpumask_bits - 1; i >= 0; i--) {
if (!cpumask_test_cpu(i, cpu_map))
continue;
for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
claim_allocations(i, sd);
init_sched_groups_capacity(i, sd);
}
}
rcu_read_lock();
for_each_cpu(i, cpu_map) {
rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
cpu_attach_domain(sd, d.rd, i);
}
rcu_read_unlock();
if (has_asym)
static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
ret = 0;
error:
__free_domain_allocs(&d, alloc_state, cpu_map);
return ret;
}Conclusion
The article covered the fundamental concepts of the Linux kernel scheduler and demonstrated, through the 5.4 source code, how run queues, scheduling classes, domains, and groups are instantiated and linked. The overall design has remained stable across kernel versions, reflecting a mature and elegant scheduling architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
