Fundamentals 19 min read

How BPF Powers the Linux sched_ext Scheduler: In‑Depth Implementation and Workflow

This article provides a comprehensive technical walkthrough of Linux's sched_ext scheduler extension, explaining how BPF enables custom scheduling policies, detailing the underlying CFS and EEVDF concepts, the new SCHED_EXT class, dispatch queues, kernel configuration, and practical code examples for building and testing BPF‑based schedulers.

Linux Kernel Journey
Linux Kernel Journey
Linux Kernel Journey
How BPF Powers the Linux sched_ext Scheduler: In‑Depth Implementation and Workflow

Linux Process Scheduler Overview

The Linux scheduler decides which task runs, on which CPU, and for how long, balancing fairness and efficiency across all tasks and hardware platforms.

CFS Scheduler

Since kernel 2.6.23 the Completely Fair Scheduler (CFS) uses a red‑black tree to select the task with the smallest virtual runtime (vruntime). Time slices are dynamically adjusted based on task priority and accumulated CPU time.

EEVDF Scheduler

EEVDF (Earliest Eligible Virtual Deadline First) was introduced in kernel 6.6. It orders tasks by a virtual deadline computed from priority and received CPU time, giving latency‑sensitive workloads lower scheduling latency without sacrificing overall throughput.

Motivation for an Extensible Scheduler

General‑purpose schedulers must serve a wide range of workloads and hardware, but custom schedulers can achieve better performance for specific scenarios at the cost of kernel maintenance. The SCHED_EXT class provides a non‑privileged scheduling class that can be programmed with eBPF, allowing users to implement custom scheduling logic without modifying the kernel source.

Implementation of SCHED_EXT

Scheduling class

Tasks are assigned to the SCHED_EXT class via sched_setscheduler. The class is placed between SCHED_IDLE and SCHED_NORMAL, so any process can select it.

Earlier versions exposed a helper scx_bpf_switch_all() that automatically moved newly created tasks to the ext class; it has been removed (see [7]).

eBPF hook functions

Within the SCHED_EXT class a set of operations (e.g., enqueue_task_scx, select_cpu, dispatch, running) may be implemented by the loaded BPF program. If an operation is not provided, the kernel falls back to the default flow.

static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) {
    /* … */
    if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED))
        SCX_CALL_OP_TASK(SCX_KF_REST, running, p);
    clr_task_runnable(p, true);
    /* … */
}

Sample registration in the reference scheduler scx_simple:

SCX_OPS_DEFINE(simple_ops,
    .select_cpu = (void *)simple_select_cpu,
    .enqueue    = (void *)simple_enqueue,
    .dispatch   = (void *)simple_dispatch,
    .running    = (void *)simple_running,
    .name       = "simple"
);

Dispatch Queues (DSQ)

sched_ext

introduces Dispatch Queues (DSQ) that can act as FIFO or priority queues. The kernel provides a global FIFO ( SCX_DSQ_GLOBAL) and per‑CPU local DSQs ( SCX_DSQ_LOCAL). Users can create additional DSQs with scx_bpf_create_dsq() and destroy them with scx_bpf_destroy_dsq().

#define SHARED_DSQ 0
s32 BPF_STRUCT_OPS_SLEEPABLE(simple_init) {
    return scx_bpf_create_dsq(SHARED_DSQ, -1);
}

Scheduling cycle workflow

When a task wakes, ops.select_cpu() is called first to suggest a CPU. The suggestion may also wake the CPU.

If the wake path does not directly dispatch, ops.enqueue() decides whether to place the task in the global DSQ, the local DSQ, or a custom DSQ.

When a CPU becomes runnable, it first checks its local DSQ, then the global DSQ, and finally calls ops.dispatch() to consume tasks via scx_bpf_dispatch() or scx_bpf_consume().

The BPF scheduler can terminate a misbehaving program (e.g., a task that fails to be scheduled within 30 seconds) and fall back to the default CFS or EEVDF class.

Enabling and Using SCHED_EXT

Kernel configuration options required for SCHED_EXT support:

CONFIG_BPF=y
CONFIG_SCHED_CLASS_EXT=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_DEBUG_INFO_BTF=y
CONFIG_BPF_JIT_ALWAYS_ON=y
CONFIG_BPF_JIT_DEFAULT_ON=y
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
CONFIG_PAHOLE_HAS_BTF_TAG=y

The latest Patch V7 (see [9]) is hosted at https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/. After building the kernel, the tools in tools/sched_ext can be used to load, test, and inspect a BPF scheduler.

Build and run the sample scheduler:

# make -j16 -C tools/sched_ext
# tools/sched_ext/scx_simple
local=0 global=3
local=5 global=24
... ^CEXIT: BPF scheduler unregistered

Query runtime state:

# cat /sys/kernel/sched_ext/state
enabled
# cat /sys/kernel/sched_ext/root/ops
simple

If CONFIG_SCHED_DEBUG is enabled, the presence of ext.enabled in /proc/self/sched confirms that a task is running under sched_ext.

Core Example: simple_select_cpu

s32 BPF_STRUCT_OPS(simple_select_cpu, struct task_struct *p,
                     s32 prev_cpu, u64 wake_flags)
{
    bool is_idle = false;
    s32 cpu;

    cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
    if (is_idle) {
        scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0); // branch 1
    }
    return cpu; // branch 2
}

If the CPU returned by ops.select_cpu() matches the final placement, cache locality improves; otherwise the final decision is made later in the dispatch phase.

Dispatch and Enqueue Hooks

FIFO vs. priority dispatch in simple_enqueue:

void BPF_STRUCT_OPS(simple_enqueue, struct task_struct *p, u64 enq_flags)
{
    if (fifo_sched) {
        scx_bpf_dispatch(p, SHARED_DSQ, SCX_SLICE_DFL, enq_flags);
    } else {
        u64 vtime = p->scx.dsq_vtime;
        if (vtime_before(vtime, vtime_now - SCX_SLICE_DFL))
            vtime = vtime_now - SCX_SLICE_DFL;
        scx_bpf_dispatch_vtime(p, SHARED_DSQ, SCX_SLICE_DFL, vtime,
                               enq_flags);
    }
}

Consuming tasks from the shared DSQ in simple_dispatch:

void BPF_STRUCT_OPS(simple_dispatch, s32 cpu, struct task_struct *prev)
{
    scx_bpf_consume(SHARED_DSQ);
}

Termination and Fallback

Sending SysRq‑S, detecting internal errors, or exceeding a 30 s scheduling timeout causes the kernel to unregister the BPF scheduler and restore all tasks to the default CFS/EEVDF class.

References

[1] Linus 强势拍板合入: BPF 赋能调度器终成正果 – https://mp.weixin.qq.com/s/dWPWuDtxQBM9Z_GXwKe0KQ

[2] Kernel git address – https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/

[3] Linux 进程管理 – https://www.ebpf.top/post/linux_process_mgr/

[4] Linux CFS 调度器:原理、设计与实现(2023) – https://arthurchiao.art/blog/linux-cfs-design-and-implementation-zh/

[5] EEVDF – https://en.wikipedia.org/wiki/Earliest_eligible_virtual_deadline_first_scheduling

[6] EEVDF submission – https://lwn.net/ml/linux-kernel/[email protected]/

[7] Removal of scx_bpf_switch_all – https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/diff/Documentation/scheduler/sched-ext.rst?h=for-6.11&id=18b2bd03371b64fdb21b31eb48095099d95b56ef

[8] Core flow diagram – https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/tree/Documentation/scheduler/sched-ext.rst?h=for-6.11

[9] Patch V7 – https://lore.kernel.org/all/[email protected]/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Linux kernelBPFprocess schedulingCFSsched_extEEVDFdispatch queues
Linux Kernel Journey
Written by

Linux Kernel Journey

Linux Kernel Journey

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.