Fundamentals 26 min read

In‑Depth Analysis of Linux CFS Task Placement and Energy Model (Part 2)

The article delves into Linux 5.10.61’s CFS task placement code, explaining the Energy‑Model framework, key data structures, and energy calculations, then walks through select_task_rq_fair’s wake‑up, fork, and exec paths, detailing EAS CPU selection, wake‑affine logic, fast and slow CPU‑selection algorithms, and how they balance performance with power on heterogeneous mobile platforms.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
In‑Depth Analysis of Linux CFS Task Placement and Energy Model (Part 2)

This article is the second part of a three‑article series on CFS task load‑balancing. It focuses on the code flow of task placement, extending the first article’s framework description and the second article’s logical process.

The kernel code examined comes from Linux 5.10.61; non‑essential parts (e.g., NUMA handling) have been omitted for brevity. Readers interested in the full source can refer to the original kernel tree.

Energy‑Model Framework and Data Structures

On embedded platforms, hardware modules operate at multiple performance levels (frequency/voltage pairs). The EnergyModel (EM) framework provides a generic interface that lets drivers expose per‑level energy consumption to other subsystems, such as the CPU scheduler. Each CPU cluster forms a performance domain; the scheduler queries EM to obtain energy costs for each performance level.

The article introduces several key structures:

struct root_domain : groups CPUs for RT scheduling; on phones there is typically a single root domain.

struct perf_domain : represents a performance domain (e.g., a cluster of CPUs sharing a frequency policy).

struct em_perf_domain and struct em_perf_state : abstract performance domains and per‑level energy information used by the EM framework.

Energy Calculation Overview

Basic energy formulae are presented, followed by kernel‑specific refinements that ignore idle power. CPU runtime at a given frequency is derived from cpu utility / cpu current_capacity . The per‑state energy cost is stored in em_perf_state->cost , and the total domain energy is computed by aggregating per‑CPU contributions.

Task Placement Overview

Task placement occurs in three scenarios: wake‑up, fork, and exec. For wake‑up, the scheduler may use the Energy‑Aware Scheduler (EAS) if the system is not over‑utilized; otherwise, it falls back to the traditional path. Wake‑affine placement prefers CPUs close to the waker’s LLC domain. Fork and exec always use the traditional slow path (find_idlest_cpu).

select_task_rq_fair() Code Walkthrough

The function is the final entry point for all placement types. Its logic is divided into three sections:

EAS bypass handling for wake‑up paths.

Selection of the appropriate scheduling domain (fast path uses the LLC domain).

CPU selection, which may follow a fast or slow path.

The article details each sub‑step (A–E) and explains how the scheduler updates wake‑e information, invokes find_energy_efficient_cpu , and decides between the chosen CPU and the previous CPU.

EAS CPU Selection

find_energy_efficient_cpu first picks, for each performance domain, the CPU with the most idle capacity, then chooses the one with the lowest estimated energy consumption. The algorithm balances energy savings against potential performance loss (e.g., cache affinity). Over‑utilized states disable EAS.

Wake‑Affine Logic

The kernel records three fields in task_struct to track wake‑e information. record_wakee updates these fields, decaying wakee_flips each second. Wake‑affine is enabled when both the waker and wake‑e have low flip counts; otherwise, it is disabled to avoid excessive clustering.

Fast and Slow CPU Selection Paths

The fast path ( select_idle_sibling ) quickly picks an idle sibling within the LLC domain, applying simple checks for idle state, capacity, and cache locality. If no suitable idle CPU is found, the scheduler expands the search to the DIE domain.

The slow path ( find_idlest_cpu ) starts from a specified scheduling domain and recursively searches for the idlest group and then the idlest CPU within that group, considering affinity, load, and group‑level metrics.

Finding the Idlest Group and CPU

Functions find_idlest_group and find_idlest_group_cpu update group load, distinguish local vs. non‑local groups, and select the CPU based on idle status, depth of idle, and overall load.

Overall, the article provides a comprehensive, code‑level walkthrough of how Linux’s CFS scheduler performs task placement, integrates energy‑aware decisions, and balances performance versus power consumption on heterogeneous mobile platforms.

schedulerLinux kernelenergy modelEASperformance domaintask placement
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.