How BPF Powers the Linux sched_ext Scheduler: In‑Depth Implementation and Workflow
This article provides a comprehensive technical walkthrough of Linux's sched_ext scheduler extension, explaining how BPF enables custom scheduling policies, detailing the underlying CFS and EEVDF concepts, the new SCHED_EXT class, dispatch queues, kernel configuration, and practical code examples for building and testing BPF‑based schedulers.
Linux Process Scheduler Overview
The Linux scheduler decides which task runs, on which CPU, and for how long, balancing fairness and efficiency across all tasks and hardware platforms.
CFS Scheduler
Since kernel 2.6.23 the Completely Fair Scheduler (CFS) uses a red‑black tree to select the task with the smallest virtual runtime (vruntime). Time slices are dynamically adjusted based on task priority and accumulated CPU time.
EEVDF Scheduler
EEVDF (Earliest Eligible Virtual Deadline First) was introduced in kernel 6.6. It orders tasks by a virtual deadline computed from priority and received CPU time, giving latency‑sensitive workloads lower scheduling latency without sacrificing overall throughput.
Motivation for an Extensible Scheduler
General‑purpose schedulers must serve a wide range of workloads and hardware, but custom schedulers can achieve better performance for specific scenarios at the cost of kernel maintenance. The SCHED_EXT class provides a non‑privileged scheduling class that can be programmed with eBPF, allowing users to implement custom scheduling logic without modifying the kernel source.
Implementation of SCHED_EXT
Scheduling class
Tasks are assigned to the SCHED_EXT class via sched_setscheduler. The class is placed between SCHED_IDLE and SCHED_NORMAL, so any process can select it.
Earlier versions exposed a helper scx_bpf_switch_all() that automatically moved newly created tasks to the ext class; it has been removed (see [7]).
eBPF hook functions
Within the SCHED_EXT class a set of operations (e.g., enqueue_task_scx, select_cpu, dispatch, running) may be implemented by the loaded BPF program. If an operation is not provided, the kernel falls back to the default flow.
static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) {
/* … */
if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
clr_task_runnable(p, true);
/* … */
}Sample registration in the reference scheduler scx_simple:
SCX_OPS_DEFINE(simple_ops,
.select_cpu = (void *)simple_select_cpu,
.enqueue = (void *)simple_enqueue,
.dispatch = (void *)simple_dispatch,
.running = (void *)simple_running,
.name = "simple"
);Dispatch Queues (DSQ)
sched_extintroduces Dispatch Queues (DSQ) that can act as FIFO or priority queues. The kernel provides a global FIFO ( SCX_DSQ_GLOBAL) and per‑CPU local DSQs ( SCX_DSQ_LOCAL). Users can create additional DSQs with scx_bpf_create_dsq() and destroy them with scx_bpf_destroy_dsq().
#define SHARED_DSQ 0
s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) {
return scx_bpf_create_dsq(SHARED_DSQ, -1);
}Scheduling cycle workflow
When a task wakes, ops.select_cpu() is called first to suggest a CPU. The suggestion may also wake the CPU.
If the wake path does not directly dispatch, ops.enqueue() decides whether to place the task in the global DSQ, the local DSQ, or a custom DSQ.
When a CPU becomes runnable, it first checks its local DSQ, then the global DSQ, and finally calls ops.dispatch() to consume tasks via scx_bpf_dispatch() or scx_bpf_consume().
The BPF scheduler can terminate a misbehaving program (e.g., a task that fails to be scheduled within 30 seconds) and fall back to the default CFS or EEVDF class.
Enabling and Using SCHED_EXT
Kernel configuration options required for SCHED_EXT support:
CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
CONFIG_PAHOLE_HAS_BTF_TAG=yThe latest Patch V7 (see [9]) is hosted at https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/. After building the kernel, the tools in tools/sched_ext can be used to load, test, and inspect a BPF scheduler.
Build and run the sample scheduler:
# make -j16 -C tools/sched_ext
# tools/sched_ext/scx_simple
local=0 global=3
local=5 global=24
... ^CEXIT: BPF scheduler unregisteredQuery runtime state:
# cat /sys/kernel/sched_ext/state
enabled
# cat /sys/kernel/sched_ext/root/ops
simpleIf CONFIG_SCHED_DEBUG is enabled, the presence of ext.enabled in /proc/self/sched confirms that a task is running under sched_ext.
Core Example: simple_select_cpu
s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
{
bool is_idle = false;
s32 cpu;
cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
if (is_idle) {
scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); // branch 1
}
return cpu; // branch 2
}If the CPU returned by ops.select_cpu() matches the final placement, cache locality improves; otherwise the final decision is made later in the dispatch phase.
Dispatch and Enqueue Hooks
FIFO vs. priority dispatch in simple_enqueue:
void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
{
if (fifo_sched) {
scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
} else {
u64 vtime = p->scx.dsq_vtime;
if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
vtime = vtime_now - SCX_SLICE_DFL;
scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
enq_flags);
}
}Consuming tasks from the shared DSQ in simple_dispatch:
void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev)
{
scx_bpf_consume(SHARED_DSQ);
}Termination and Fallback
Sending SysRq‑S, detecting internal errors, or exceeding a 30 s scheduling timeout causes the kernel to unregister the BPF scheduler and restore all tasks to the default CFS/EEVDF class.
References
[1] Linus 强势拍板合入: BPF 赋能调度器终成正果 – https://mp.weixin.qq.com/s/dWPWuDtxQBM9Z_GXwKe0KQ
[2] Kernel git address – https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/
[3] Linux 进程管理 – https://www.ebpf.top/post/linux_process_mgr/
[4] Linux CFS 调度器:原理、设计与实现(2023) – https://arthurchiao.art/blog/linux-cfs-design-and-implementation-zh/
[5] EEVDF – https://en.wikipedia.org/wiki/Earliest_eligible_virtual_deadline_first_scheduling
[6] EEVDF submission – https://lwn.net/ml/linux-kernel/[email protected]/
[7] Removal of scx_bpf_switch_all – https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/diff/Documentation/scheduler/sched-ext.rst?h=for-6.11&id=18b2bd03371b64fdb21b31eb48095099d95b56ef
[8] Core flow diagram – https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/tree/Documentation/scheduler/sched-ext.rst?h=for-6.11
[9] Patch V7 – https://lore.kernel.org/all/[email protected]/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
