Fundamentals 24 min read

Understanding the PELT (Per‑Entity Load Tracking) Algorithm in the Linux Kernel Scheduler

PELT (Per‑Entity Load Tracking) replaces the coarse per‑run‑queue load model in Linux’s CFS scheduler with fine‑grained, decay‑based load and utility measurements for each scheduling entity, enabling more accurate load‑balancing, CPU‑frequency decisions, and hierarchical group scheduling across heterogeneous cores.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Understanding the PELT (Per‑Entity Load Tracking) Algorithm in the Linux Kernel Scheduler

Linux is a universal operating‑system kernel that runs on everything from network servers to embedded devices. Designing a good Linux process scheduler is challenging because it must satisfy many constraints: fairness, fast response, high throughput, and low power consumption.

The pre‑3.8 CFS scheduler used per‑run‑queue (per‑rq) load tracking, which is too coarse for modern scheduling needs. Starting with kernel 3.8, the kernel introduced the PELT (Per‑Entity Load Tracking) algorithm to provide finer‑grained load information.

Why per‑entity load tracking is needed

Accurate scheduling requires predicting each task’s future CPU demand. This demand consists of two aspects: task utility (how much CPU power the task needs) and task load (how much load the task contributes to the system). Utility helps decide which CPU (big or little core) a task should run on, while load is used for load‑balancing across CPUs.

Before 3.8, the kernel only tracked the total load of each run‑queue, which did not reveal the contribution of individual tasks. PELT tracks load at the level of scheduling entities (tasks or groups of tasks), enabling more precise load‑balancing and CPU‑frequency decisions.

Basic method of the PELT algorithm

PELT divides time into 1024 µs windows. Within each window, an entity’s contribution to system load is proportional to the time it spends runnable (running or ready). The instantaneous load Li for an entity is calculated as:

Li = load_weight × (t / 1024)

where t is the runnable time in the window. The instantaneous utility Ui is computed as:

Ui = Max_CPU_capacity × (t / 1024)

In mobile devices, the maximum CPU capacity of a big core is defined as 1024.

Because instantaneous values fluctuate rapidly, PELT applies a sliding‑average (exponential decay) to obtain a stable average load L:

L = L0 + L1·y + L2·y² + L3·y³ + …

Here Li is the instantaneous load in window i , and y is a decay factor (y^32 = 0.5), meaning a contribution halves after 32 windows.

The decay‑based series allows simple computation: each update multiplies the previous total by y and adds the new L0 value.

Kernel scheduler fundamentals

The kernel represents a scheduling entity with struct sched_entity . A CFS run‑queue is represented by struct cfs_rq , which aggregates the load weights of all entities on that CPU. The time slice allocated to an entity is:

Sched_slice = sched_period × (se_load_weight / cfs_rq_load_weight)

Task groups (cgroups) are represented by struct task_group . A task group may contain multiple entities (task_se or group_se) and forms a hierarchical scheduling structure. When group scheduling is enabled, each group has its own cfs_rq (group_se) and a share value that determines the total load weight of the group.

Load weight ( struct load_weight ) stores the raw weight and an inverse weight for fast calculations. Average load is stored in struct sched_avg , which contains both load and runnable‑load components.

How load weight and average load are determined

For a task entity, load weight is derived from its nice value. For a group entity, the weight is distributed among its CPUs based on the group’s share and the relative weight of each underlying cfs_rq . The kernel function calc_group_shares computes this distribution.

Average load ( load_avg ) for an entity is updated by __update_load_avg_se . The kernel also tracks runnable load separately, because blocked tasks still contribute to total load but not to runnable load.

Propagation of load updates

When a new task entity is attached to the hierarchy, its load is propagated upward:

Determine the entity’s load and optionally update it.

Update the load of the leaf cfs_rq and record the amount to propagate.

Update the load of the owning task group.

Move up one level, updating the parent group’s load.

Continue propagating until the top‑level cfs_rq is reached.

During each scheduler tick, entity_tick updates the average load of the entity and its run‑queue, then calls update_load_avg . If the entity is a group, additional propagation may be required.

Load updates on wake‑up and sleep

On wake‑up, if the entity was not on a run‑queue, enqueue_entity is called to add its load; otherwise only update_load_avg and update_cfs_group are invoked. The code path is in enqueue_task_fair .

On sleep, dequeue_task_fair (the inverse of enqueue) removes the runnable load weight from the run‑queue while keeping the blocked load, which continues to decay over time.

Reference

1. Linux kernel source code (5.4.28). 2. Documentation in linux-5.4.28/Documentation/scheduler/* .

kernelschedulerLinuxCFSLoad TrackingPELT
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.