Mastering Linux Process Scheduling: From FCFS to CFS Explained
This comprehensive guide walks through the fundamentals of process scheduling in operating systems, covering scheduling concepts, OS classifications, process types, priority mechanisms, preemptive vs cooperative models, design goals, classic algorithms such as FCFS, SJF, RR, MLFQ, and the evolution of Linux schedulers from O(n) and O(1) to the modern CFS, complete with code excerpts and practical examples.
Prerequisite Knowledge
What scheduling actually is
Types of operating systems and the most common one
Process classification and priority concepts
Difference between pre‑emptive and non‑pre‑emptive scheduling
How to design a usable scheduler
Scheduling Concept
Scheduling is the solution to the mismatch between limited resources and abundant demand, a problem that appears both in daily life (e.g., supermarket stock allocation, railway capacity planning) and in computers where CPU time is scarce compared to the number of runnable processes.
Scheduling solves the resource‑demand mismatch; typically resources are scarce while demand is abundant.
Operating‑System Classification
Operating systems can be grouped into three major families:
Batch Processing Systems – jobs are submitted in bulk and run without user interaction.
Real‑Time Systems – correctness depends on both logical results and timing constraints.
Time‑Sharing Systems – CPU time is divided into short slices so that many users feel they have exclusive access. Linux is a time‑sharing system.
Process Classification
Processes are categorized by their dominant behavior:
I/O‑bound : spend most of their time waiting for I/O, e.g., a socket‑listening daemon.
CPU‑bound : spend most of their time performing calculations.
I/O‑bound processes are usually given higher scheduling priority, while CPU‑bound processes receive lower priority.
Process Priority
Linux distinguishes two priority ranges:
Real‑Time processes: PR 0‑99 (higher number = higher priority).
Normal processes: PR 100‑139 (lower number = higher priority).
Nice value (‑20 to 19) influences PR for normal processes: PR_new = PR_old + Nice.
Positive nice values make a process more “polite” (lower priority), negative values make it less polite (higher priority).
Cooperative vs. Pre‑emptive Scheduling
Cooperative (or collaborative) scheduling – a process runs until it voluntarily yields or finishes.
Pre‑emptive scheduling – the kernel can interrupt a running process and switch to another one.
Linux uses pre‑emptive scheduling to improve CPU utilization and response time, at the cost of additional context‑switch overhead.
Scheduler Design Goals
Turnaround time : total time from a process entering the run queue to its completion.
Response time : time from a process entering the run queue to its first execution on the CPU.
Real‑time processes should always be scheduled before normal processes; I/O‑bound processes receive frequent, short slices, while CPU‑bound processes receive longer slices.
Design Implementation
Implementing a scheduler involves two core parts:
Algorithm design – the theoretical scheduling policy.
Engineering implementation – turning the algorithm into kernel code.
Classic Scheduling Algorithms
First‑Come‑First‑Served (FCFS)
Processes are executed in the order they arrive. Example calculations show average turnaround time of 20 ms and average response time of 6.67 ms for a short‑task scenario, but the algorithm suffers from the “convoy effect” when a long task blocks later processes.
Shortest Job First (SJF)
Always run the shortest ready task first, reducing both turnaround and response times compared with FCFS. However, it requires knowledge of task lengths, which is rarely available.
Pre‑emptive SJF (PSJF)
Combines SJF with pre‑emptive switching, improving response time but introducing possible starvation for long tasks.
Round‑Robin (RR)
Each process receives a fixed time slice; if the slice expires, the scheduler picks the next process. Short slices increase context‑switch overhead, long slices increase response time.
Multilevel Feedback Queue (MLFQ)
Processes start in a high‑priority queue with short slices; based on runtime behavior they may be demoted to lower‑priority queues with longer slices. The design mitigates starvation by periodically boosting all tasks.
Linux Scheduler Evolution
O(n) scheduler (kernel 2.4‑2.6)
O(1) scheduler (kernel 2.6.0‑2.6.22)
CFS scheduler (kernel 2.6.23‑present)
O(n) Scheduler
Uses a single global runqueue; the kernel scans the list to find the next process. The core data structure is struct task_struct:
struct task_struct {
long counter;
long nice;
unsigned long policy;
int processor;
unsigned long cpus_runnable, cpus_allowed;
}The goodness() function computes a weight based on remaining time slice, nice value, and whether the process is real‑time.
static inline int goodness(struct task_struct *p, int this_cpu, struct mm_struct *this_mm) {
int weight = -1;
if (p->policy & SCHED_YIELD)
goto out;
if (p->policy == SCHED_OTHER) {
weight = p->counter;
if (!weight)
goto out;
#ifdef CONFIG_SMP
if (p->processor == this_cpu)
weight += PROC_CHANGE_PENALTY;
#endif
if (p->mm == this_mm || !p->mm)
weight += 1;
weight += 20 - p->nice;
goto out;
}
weight = 1000 + p->rt_priority;
out:
return weight;
}Key observations:
If counter reaches zero, the process is not selected.
Real‑time priority = 1000 + static priority, far higher than normal processes.
O(1) Scheduler
Introduces per‑CPU runqueues and two arrays (active and expired) to achieve O(1) pick‑next complexity. The main structure is:
struct runqueue {
spinlock_t lock;
unsigned long nr_running;
unsigned long long nr_switches;
unsigned long expired_timestamp, nr_uninterruptible;
unsigned long long timestamp_last_tick;
task_t *curr, *idle;
struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
int best_expired_prio;
atomic_t nr_iowait;
...
};Each priority level has a bitmap and a linked list of tasks. Picking the next task involves finding the first set bit in the bitmap:
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);When the active array is empty, active and expired pointers are swapped instantly.
CFS Scheduler
CFS (Completely Fair Scheduler) abandons fixed time slices and allocates CPU proportionally to a process’s weight, derived from its nice value. The mapping array:
const int sched_prio_to_weight[40] = {
88761, 71755, 56483, 46273, 36291,
29154, 23254, 18705, 14949, 11916,
9548, 7620, 6100, 4904, 3906,
3121, 2501, 1991, 1586, 1277,
1024, 820, 655, 526, 423,
335, 272, 215, 172, 137,
110, 87, 70, 56, 45,
36, 29, 23, 18, 15
};Virtual runtime (vruntime) equalizes the weighted CPU share. For a process with weight w, its vruntime increments as wall_time * (1024 / w). The kernel stores processes in a red‑black tree keyed by vruntime; the left‑most node is the next to run.
const u32 sched_prio_to_wmult[40] = {
48388, 59856, 76040, 92818, 118348,
147320, 184698, 229616, 287308, 360437,
449829, 563644, 704093, 875809, 1099582,
1376151, 1717300, 2157191, 2708050, 3363326,
4194304, 5237765, 6557202, 8165337, 10153587,
12820798, 15790321, 19976592, 24970740, 31350126,
39045157, 49367440, 61356676, 76695844, 95443717,
119304647, 148102320, 186737708, 238609294, 286331153
};CFS also supports multiple scheduling classes (deadline, real‑time, normal, idle) allowing the kernel to run different policies side‑by‑side.
Key Takeaways
O(n) scheduler suffers from global lock contention and poor cache locality.
O(1) scheduler introduces per‑CPU runqueues and a multilevel feedback queue with 140 priority levels.
CFS uses a red‑black tree and virtual runtime to achieve “complete fairness” while adapting slice length dynamically.
All three designs try to identify interactive (I/O‑bound) tasks, but perfect classification is impossible.
CFS’s weight‑based virtual runtime provides a more adaptable solution, though it is not a universal silver bullet.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
