Fundamentals 16 min read

Linux Scheduler Realities: Tick Granularity, Event‑Driven Design, and Big‑Little Trade‑offs

The article examines Linux’s CFS scheduler, explaining why its 4 ms tick (250 Hz) defines task switching intervals, how event‑driven mechanisms like task wake and IRQ updates shape scheduling, and compares PELT and WALT load‑tracking methods, especially under heterogeneous big‑little CPU architectures.

Linux Code Review Hub
Linux Code Review Hub
Linux Code Review Hub
Linux Scheduler Realities: Tick Granularity, Event‑Driven Design, and Big‑Little Trade‑offs

Ideal vs. reality – Earlier we noted that a thread with priority 100 competing with a thread of priority 120 does not achieve the expected fair resource split. A quick look at a systrace capture shows an approximate share of 24 : 1856, a rough figure used only for illustration; the article does not delve into the exact conversion from priority to weight or from weight to resource share.

When both threads have priority 120, the observed resource ratio is 1 : 1, with the scheduler rotating the threads every 4 ms.

Why 4 ms? The Linux kernel’s timer tick runs at a period of 4 ms (250 Hz). This tick underlies the “proportional scheduling” theory, which has a full mathematical model dating back to a 1996 paper and, more recently, the eevdf scheduler introduced in kernel 6.6. However, engineering constraints prevent a direct translation of the theory into practice.

The CFS scheduler selects the next task via _next_task, which extracts the leftmost node from a red‑black tree. It knows the virtual runtime of the currently running thread (V1) and that of the candidate thread (V2). As the running thread executes, V1 continuously increases. The scheduler does not preempt immediately when V1 exceeds V2 because such immediate preemption would incur prohibitive overhead.

From a throughput perspective, the scheduler itself consumes CPU cycles; therefore, higher throughput is achieved when the scheduler intervenes as little as possible.

Event‑driven scheduler – The scheduler lacks a “god‑view” and instead relies on discrete events to drive decisions. The two most common events are:

const char *task_event_names[] = {
    "PUT_PREV_TASK",
    "PICK_NEXT_TASK",
    "TASK_WAKE",
    "TASK_MIGRATE",
    "TASK_UPDATE",
    "IRQ_UPDATE"
};

Task wake : When thread A wakes thread B, the scheduler checks whether B should preempt the currently running task. If B’s virtual runtime is only marginally smaller than the running thread’s, the scheduler may decide not to preempt, allowing the current thread to continue.

IRQ update : The kernel tick is an architecture timer interrupt that fires every 4 ms, prompting the scheduler to update thread states and make scheduling decisions.

Because scheduling is event‑driven, the virtual runtime curve exhibits discrete jumps rather than a smooth line. The article includes a diagram (ideal continuous line vs. actual stepped line) illustrating this effect.

Increasing the tick frequency (e.g., to 1000 Hz) would improve precision but also increase overhead, reducing overall system throughput.

Metrics – Two key metrics are discussed:

Throughput : the amount of useful work completed; higher is better, and the scheduler should intervene minimally.

Latency : the time from a thread being woken to it acquiring CPU resources; lower is better for perceived responsiveness.

These metrics are often at odds: maximizing throughput favors fewer scheduler interventions, while minimizing latency requires more frequent monitoring. Big‑Little (heterogeneous) CPUs – The article shifts to discuss heterogeneous multi‑processor (HMP) designs where performance cores and efficiency cores have different clock sources and performance‑power curves. Determining whether a “big” task should run on a big core and a “small” task on a little core is non‑trivial. Two load‑tracking models are presented:

PELT (per‑entity load tracking)

PELT accumulates load over time with exponential decay: each past contribution is weighted by y^i where y^32 = 0.5 . Recent activity has higher weight. Even when a CPU is idle, its PELT load is non‑zero because decay takes time to reach zero, and this load influences cluster‑wide frequency scaling.

WALT (Window Assisted Load Tracking)

WALT divides time into aligned windows and measures the proportion of time a thread runs within each window. This yields faster load accumulation, which is beneficial for interactive workloads (e.g., a 60 Hz screen’s 16.7 ms frame). However, window boundaries can split a single logical task across two windows, leading to inaccurate load representation for non‑periodic threads. Both methods require a statistical observation period and struggle with sudden load spikes, as illustrated by a diagram showing a load‑tracking delay caused by an APK implementation issue that resulted in noticeable UI stutter. Finally, the article notes that while PELT and WALT can help approximate task size, the introduction of big‑little architectures adds further complexity to scheduling decisions, setting the stage for a future discussion of the Energy‑Aware Scheduler (EAS).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

schedulerCFSPELTbig-LITTLEWALT
Linux Code Review Hub
Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.