Understanding Fair CFS Scheduling Classes: SCHED_NORMAL, SCHED_BATCH, and SCHED_IDLE
The article explains how Linux's Completely Fair Scheduler (CFS) uses nice values and weight tables to compute virtual runtime, organizes tasks in a red‑black tree, and demonstrates through experiments how SCHED_NORMAL, SCHED_BATCH, and SCHED_IDLE differ in priority, preemption rules, and CPU share.
The Fair class of the Linux scheduler aims to give each runnable task an equal share of CPU time; with five tasks each would receive 20 % in an ideal scenario. In practice the nice value (range –20 to 19) determines a task’s weight, where a lower nice gives higher weight and thus more CPU.
Early O(1) schedulers adjusted the time‑slice size directly based on nice. Modern CFS instead influences the probability that a task’s virtual runtime ( vruntime) advances, using a weight derived from the nice value. The baseline weight for nice=0 is 1024, and the full weight table (shown in the image) maps each nice level to a specific weight.
CFS stores all scheduling entities in a single red‑black tree ordered by vruntime, ensuring the task with the smallest virtual runtime runs next. This data structure provides O(log n) insertion and lookup, which is far more efficient than a linked‑list implementation that would incur O(n) cost.
An experiment creates two additional threads (three threads total including the main thread) that each execute while(1). The threads are bound to CPU 0 with taskset to eliminate load‑balancing effects. Using top -H, the initial CPU distribution is roughly equal at 33 % per thread.
Next, the nice values are changed: thread 4606 is reniced to –1 (higher priority) and thread 4608 to 1 (lower priority). The results show that the –1 thread consumes about 1.25 × the CPU of the unchanged thread, while the +1 thread receives about 0.8 ×, confirming that weight directly influences CPU share. Raising priority requires sudo, whereas lowering it does not.
The article notes that if the red‑black tree were replaced by a simple list, insertion would degrade from O(log n) to O(n), dramatically increasing scheduling overhead.
CFS defines three scheduling policies:
SCHED_NORMAL : regular tasks.
SCHED_BATCH : non‑interactive batch jobs.
SCHED_IDLE : very low‑priority tasks (weight defined by WEIGHT_IDLEPRIO = 3).
All three policies share the same red‑black tree rather than using separate trees. In the wakeup_preempt() callback of the fair sched_class, an idle task can be pre‑empted by any non‑idle task. The check_preempt_wakeup_fair() function further prevents batch and idle tasks from pre‑empting normal tasks and also blocks intra‑class pre‑emption among batch or idle tasks.
A final code example sets the three threads to different policies and nice values (shown in the table image). After compiling and running on a single CPU, top -H shows that SCHED_NORMAL does not block SCHED_BATCH or SCHED_IDLE; the two threads with weight 15 (one normal, one batch) obtain identical CPU usage, each about five times the usage of the idle thread with weight 3. The distinction among the three policies lies mainly in which tasks may pre‑empt others when they wake, not in the raw CPU share, which follows the weight‑based CFS calculation. The behavior can also be verified with the chrt command.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
