What Optimizations Did Zang Chunxin Introduce to the Linux EEVDF Scheduler?
The article traces the evolution of Linux CPU scheduling from CFS to the EEVDF algorithm, explains EEVDF’s deadline‑driven allocation with concrete task examples, and details Zang Chunxin’s patch that enables shorter slices and wake‑up preemption to reduce latency.
Timeline of Linux Scheduler Development
1991 – Linus Torvalds releases the Linux kernel.
1995 – Hussein Abdel‑Wahab publishes the EEVDF algorithm (Earliest Eligible Virtual Deadline First) for proportional‑share resource allocation.
2004 – Ingo Molnar rejects a proposal for a pluggable CPU scheduler framework, arguing it would fragment scheduler development.
2023 – Peter Zijlstra merges EEVDF into the mainline kernel, replacing the long‑standing Completely Fair Scheduler (CFS).
2024 – Tejun Heo and others integrate an eBPF‑based sched‑ext framework, enabling scheduler plug‑ins.
2025 – At the CLK2025 scheduling forum, Tencent engineer Zang Chunxin presents optimizations to EEVDF.
CFS Fairness Example
CFS distributes CPU time proportionally to task weights derived from nice values. For three tasks with weights A=2, B=1, C=1, a scheduling period is divided 2:1:1.
EEVDF Allocation
EEVDF also respects weight ratios but schedules tasks by earliest virtual deadline. The task with the smallest deadline runs first; a later‑arriving task with an even earlier deadline can pre‑empt the running one.
EEVDF introduces two factors: the original CFS nice weight and the requested size r (CPU time a task wants per activation). Assuming r = 1 for all three tasks, their target shares become 50 %, 25 %, and 25 % respectively.
At time 0, task A must finish its first unit before physical time 2 to keep a 50 % share, giving it a physical deadline of 2. Tasks B and C have deadlines of 4. These deadlines define each task’s scheduling period (A: period 2, B/C: period 4).
When all tasks are eligible at time 0, task A has the smallest virtual time and runs first for one unit of work (size 1). After running one physical unit, A’s virtual time advances to 0.5, making it ineligible until the global virtual time reaches 0.5 (physical time 2). It then becomes eligible again, preserving its 50 % quota.
Eligibility Condition
The eligibility test can be expressed as:
task_virtual_time * task_weight + ... >= task_virtual_time * total_weightIn the kernel the check is implemented as:
avg_vruntime() >= se->vruntimeSlice Protection
To limit excessive pre‑emptions, EEVDF adds slice protection: a task must run for a minimum slice before it can be pre‑empted. The default slice is 0.7 ms unless the RUN_TO_PARITY feature is enabled.
unsigned int normalized_sysctl_sched_base_slice = 700000ULL;When RUN_TO_PARITY is enabled, the slice becomes:
0.70 msec * (1 + ilog(ncpus)) // units: nanosecondsZang Chunxin’s Patch
The initial patch, titled "sched/fair: Reschedule the cfs_rq when current is ineligible" , forces a reschedule when a newly woken task finds the current task ineligible, reducing wake‑up latency.
The final patch merged by Peter Zijlstra is named "sched/eevdf: Allow shorter slices to wakeup‑preempt" and introduces the PREEMPT_SHORT feature. When enabled and the newly woken task’s slice is smaller than the current task’s slice ( pse->slice < se->slice), slice protection is disabled, allowing immediate pre‑emption.
The scheduler calls __pick_eevdf() with !do_preempt_short as the second argument; when PREEMPT_SHORT is active, the protect flag is false and the slice‑protection code is skipped, permitting the new task to be selected immediately.
If a task’s request size is not set by userspace, it defaults to sysctl_sched_base_slice. Users can specify a custom slice via sched_attr.sched_runtime, expressing their size demand.
Open Discussion
Two discussion points were raised:
The granularity of PREEMPT_SHORT is coarse; eBPF could be used to implement finer‑grained task classification for pre‑emption.
Multi‑core load balancing for deadline‑driven tasks: whether urgent tasks can be distributed across CPUs without being blocked by real‑time tasks. An example scenario considered an 8‑CPU system with eight small‑request tasks, suggesting possible feasibility.
References
EEVDF paper: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
Ingo Molnar’s 2004 comment: https://lwn.net/Articles/109460/
Patch discussion (CLK2025): https://lore.kernel.org/lkml/[email protected]/
Final patch commit: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-unstable&id=85e511df3cec46021024176672a748008ed135bf
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
