Fundamentals 15 min read

Load Balancing Mechanisms in the Linux CFS Scheduler: Periodic, NOHZ Idle, and New Idle Load Balancing

The article explains how the Linux CFS scheduler balances load using three mechanisms—periodic balancer on busy CPUs, NOHZ idle balancer that wakes idle CPUs in tickless mode, and the new idle balancer that checks overload and cache state—detailing their triggers, IPI interactions, timing intervals, and key data structures.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Load Balancing Mechanisms in the Linux CFS Scheduler: Periodic, NOHZ Idle, and New Idle Load Balancing

This article is the third part of a three‑article series on CFS task load balancing in the Linux kernel. The first part covered the framework, the second described two typical scenarios (task placement and active up‑migration), and this part presents the implementation details of load balancing through a Q&A style.

The Linux scheduler contains several kinds of load balancers:

Periodic balancer – runs on busy CPUs and periodically moves runnable tasks from overloaded domains, groups or CPUs to the current CPU.

Idle balancer – has two variants: the NOHZ idle balancer (used when the kernel is built with NOHZ/tickless mode) and the new idle balancer.

Periodic load balance is triggered on each scheduler tick. When the tick arrives, scheduler_tick() calls trigger_load_balance() , which eventually raises the SCHED_SOFTIRQ soft‑interrupt to execute run_rebalance_domains() . This mechanism only balances load among busy CPUs.

NOHZ load balance is activated when a busy CPU detects that its runqueue is heavily loaded while other CPUs are idle in tickless mode. The busy CPU sends an IPI to wake an idle CPU (the “kickee”) and sets the nohz_flags of the kickee’s runqueue. The kickee handles the IPI in scheduler_ipi() , sees the flag, and also raises SCHED_SOFTIRQ to run run_rebalance_domains() . This balances the load of all idle CPUs as a group.

New idle load balance is performed when a CPU is about to enter idle. The CPU (the “kicker”) checks whether the system is overloaded (e.g., multiple runnable tasks, misfit tasks, or heavy RT/IRQ load) and, if so, wakes an idle CPU to pull tasks from busy CPUs. The same soft‑interrupt path ( SCHED_SOFTIRQ → run_rebalance_domains() ) is used.

The interaction between kicker and kickee consists of three steps:

Kicker marks the selected kickee’s runqueue nohz_flags .

An IPI is sent to wake the kickee.

Kickee processes the IPI, detects the flag, and triggers the soft‑interrupt that runs the load‑balancing code.

Conditions that cause an active CPU to wake an idle CPU for NOHZ load balance include:

The active CPU’s runqueue has at least two runnable tasks.

There exist idle CPUs in tickless mode.

The trigger frequency must be limited to avoid excessive wake‑ups.

The active CPU is heavily loaded by RT tasks or IRQ handling.

On heterogeneous systems, a misfit task may prompt waking a higher‑performance idle CPU.

To control the frequency of NOHZ idle balance, the scheduler maintains a global nohz structure containing nr_cpus , idle_cpus_mask , and next_balance . The next_balance timestamp is chosen as the smallest rq->next_balance among all idle CPUs, ensuring that the balancer runs only when needed.

New idle load balance also checks two main factors before executing:

CPU cache state – pulling tasks to a CPU with a hot cache may degrade performance and increase power consumption.

System‑wide overload – determined by the overload flag in the root domain, which is set when a CPU has more than one runnable task or a single misfit task.

If the CPU’s average idle time is shorter than the sum of the cost of a new idle balance and the current balancing overhead, the balancer is skipped.

The interval at which a sched domain performs load balancing depends on the domain level and current imbalance. On a typical 4+4 big‑LITTLE mobile platform, the MC (inter‑cluster) domain uses a min_interval of 4 ms and a max_interval of 8 ms, while the DIE (intra‑cluster) domain uses 8 ms to 16 ms. The actual balance_interval starts at the minimum and grows toward the maximum as the system becomes more balanced.

Key data structures involved in load balancing include struct sched_domain (which holds interval parameters and other balancing data) and the per‑runqueue fields used to track idle time, overload status, and the next balance point.

In summary, all three load‑balancing mechanisms (periodic, NOHZ idle, and new idle) are triggered by the SCHED_SOFTIRQ soft‑interrupt and ultimately call rebalance_domains() , which invokes load_balance() for each relevant sched domain. Detailed code analysis is deferred to future articles.

kernelLoad BalancingSchedulerlinuxCFSIdle BalanceNOHZ
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.