Fundamentals 17 min read

SCHED_EXT: The Extensible Kernel Scheduler Class for Custom Scheduling

The article provides a detailed technical overview of Linux's new extensible scheduler class SCHED_EXT, explaining its architecture, eBPF‑based customization interfaces, core data structures, queue management, scheduling points, and a central‑scheduler example, while comparing it with traditional CFS and per‑CPU run‑queues.

Linux Code Review Hub
Linux Code Review Hub
Linux Code Review Hub
SCHED_EXT: The Extensible Kernel Scheduler Class for Custom Scheduling

Overview of the extensible scheduler class (sched_ext)

The class name Ext stands for “Extensible”. Developer Tejun Heo contributed a 34‑patch set that adds a new kernel scheduling class ext_sched_class allowing user‑space eBPF programs to modify scheduling policies without rebuilding the kernel. The class is enabled by configuring CONFIG_SCHED_CLASS_EXT; enabling group‑scheduling logic additionally requires CONFIG_EXT_GROUP_SCHED.

Core framework

The implementation consists of two parts:

bpf scheduler – loads and registers BPF programs and uses the BPF STRUCT_OPS feature (introduced in kernel 5.4) to replace kernel function pointers from user space.

core scheduler – adds ext_sched_class that mirrors the basic operations of existing classes (enqueue, dequeue, preempt, task selection) while exposing many callbacks for customization.

Key source files for the core part: include/linux/sched/ext.h – ops interface declarations and core data‑structure definitions. kernel/sched/ext.h – ordinary function‑interface and flag declarations. kernel/sched/ext.c – core implementation of the extensible scheduler.

Key source files for the BPF side (illustrated with the central scheduler example): tools/sched_ext/scx_central.c – loading, registration, and monitoring logic. tools/sched_ext/scx_central.bpf.c – the actual BPF policy implementation.

Extension interfaces

The header include/linux/sched/ext.h defines struct sched_ext_ops. Each operation is documented; the select_cpu callback is shown as an example. The core scheduler checks the macro SCX_HAS_OP to see whether a callback is registered; if not, it falls back to the default scx_select_cpu_dfl(). Registration status is tracked via a static‑key array indexed by the offset of the function‑pointer member inside struct sched_ext_ops.

Core data structures

Each task belonging to the extensible class has a struct sched_ext_entity that stores scheduling state. Dispatch queues are represented by struct scx_dispatch_q. User‑created dispatch queues are stored in an rhashtable.

Scheduling logic

The ext_sched_class provides the standard enqueue, dequeue, preempt, and task‑selection hooks. Its priority is lower than the real‑time and fair classes. When the BPF scheduler is loaded with the switch_all option, all non‑real‑time, non‑deadline tasks are moved to ext_sched_class, and newly created tasks follow the same rule.

Queue maintenance

Instead of per‑CPU run‑queues, sched_ext uses dispatch queues (DSQs). Two categories exist:

Built‑in DSQs – a local per‑CPU DSQ and a single global DSQ.

User‑created DSQs – generated by BPF programs and kept in an rhashtable.

Each task carries a 64‑bit dsq_id field that encodes the queue type and, for local queues, the target CPU. Bit 63 indicates built‑in (1) or user‑created (0); bit 62 indicates local‑on with the lower bits holding the CPU number; the remaining bits are reserved or hold queue‑specific identifiers.

Scheduling timing

For the traditional fair class, scheduling occurs on wake‑up and periodic tick interrupts. The extensible class can be pre‑empted by higher‑priority classes on wake‑up, but it does not implement check_preempt_curr, so ext‑tasks cannot pre‑empt each other. A context switch occurs only when the task’s time slice ( scx.slice) expires during a tick.

Central scheduler example

The tools/sched_ext directory provides a sample “central scheduler”. Its design goals are:

Use a dedicated central CPU for dispatch decisions; other CPUs request work from it.

Operate tick‑less with unlimited time slices.

Allow a kernel thread to unconditionally pre‑empt; other tasks are pre‑empted via a kick to the target CPU.

Key operations (implemented in scx_central.c and scx_central.bpf.c) are:

ops.init – converts all NORMAL tasks to EXT, creates a fallback DSQ via scx_bpf_create_dsq, and starts a 1 ms periodic timer that monitors CPU execution. When a CPU’s slice expires, the timer issues a SCX_KICK_PREEMPT kick to clear the slice and force a reschedule.

ops.select_cpu – always returns the user‑specified central CPU.

ops.enqueue – for kernel‑thread tasks, sets dsq_id to SCX_DSQ_LOCAL and uses SCX_ENQ_PREEMPT to allow pre‑emption. For other tasks, pushes them into a BPF map central_q (capacity 4096). If the map is full, tasks fall back to the fallback DSQ and may trigger a kick to the central CPU.

ops.dispatch – when both built‑in local and global DSQs are empty, the callback runs. On a non‑central CPU, it first tries to consume from the fallback DSQ; if empty, it kicks the central CPU. On the central CPU, it first serves pending requests from other CPUs (pulling tasks from central_q and dispatching them to the requesting CPU’s local DSQ), then consumes tasks for itself from either the fallback DSQ or central_q.

The design demonstrates high flexibility but also drawbacks: no weight‑based fairness, heavy reliance on the central CPU, and limited community acceptance.

Conclusion

The extensible scheduler class provides a framework for building custom schedulers via eBPF. Its strength lies in the ability to experiment with and deploy workload‑specific policies without kernel recompilation. However, the same extensibility introduces complexity, synchronization overhead, and challenges for broad adoption.

References

Sched Ext source code: https://github.com/sched-ext/sched_ext/tree/sched_ext

The extensible scheduler class: https://lwn.net/Articles/922405/

Kernel operations structures in BPF: https://lwn.net/Articles/811631/

Introduce BPF STRUCT_OPS: https://lwn.net/Articles/809092/

eBPF Documentation: https://ebpf.io/what-is-ebpf/

BPF Kernel Functions (kfuncs): https://docs.kernel.org/bpf/kfuncs.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

schedulereBPFLinux kernelsched_extEEVDFextensible scheduling
Linux Code Review Hub
Written by

Linux Code Review Hub

A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.