Understanding kfunc in Linux sched_ext: Mechanism and Usage
This article explains the kfunc mechanism behind Linux's sched_ext, covering instruction modification, verification, address fixup, and detailing common kfunc functions such as scx_bpf_dispatch, scx_bpf_consume, and queue management APIs.
1. kfunc mechanism
In the kernel, kfunc provides a helper function to call kernel functions, replacing the older help‑func mechanism. Each kfunc is identified by a unique btf_id. During libbpf verification, libbpf finds the target kernel function’s btf_id and later the actual call jumps to the address obtained by __bpf_base_call.
1. Instruction modification
insn->code == (BPF_JMP | BPF_CALL)
insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* new flag */
insn->imm == func_btf_id /* kernel function's btf_id */All kfuncs are stored in the .BTF_ids section; the resolve_btfids tool adds the btf_id entries in the link script so they can be located in the kernel’s BTF.
During the initial verifier stage, kernel‑function call information is collected into struct bpf_kfunc_desc and saved in prog->aux->kfunc_tab for JIT use.
2. Verification
check_kfunc_call()validates the kernel‑function call instruction.
It ensures the kernel function is allowed for the specific BPF program type. btf_check_kfunc_args_match() checks that registers can be used as kernel‑function arguments.
3. Address fix‑up
In the do_misc_fixups() stage, fixup_kfunc_call() replaces insn->imm with the actual kernel function address, and the JIT can locate the function model via bpf_jit_find_kfunc_model().
2. kfuncs defined in sched_ext
All kfuncs are defined in sched/ext.c and declared inside the BTF_ID_FLAGS macro. Currently there are 24 such functions.
BTF_KFUNCS_START(scx_kfunc_ids_enqueue_dispatch)
BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU)
BTF_ID_FLAGS(func, scx_bpf_dispatch_vtime, KF_RCU)1. scx_bpf_dispatch
Enqueues a task onto a specified dispatch queue, typically invoked from the ops.enqueue callback such as the simple_enqueue implementation.
2. scx_bpf_dispatch_vtime
Same as scx_bpf_dispatch but also sets dsq_vtime, used for non‑FIFO dispatch queues.
3. scx_bpf_consume
Analogous to pick_next_task; removes a task from a queue and places it on a CPU. It first tries the local dsq; if empty, it pulls from the global dsq for load balancing. The local dsq resembles the CFS run‑queue in per‑CPU form, while the global dsq enables cross‑CPU task migration.
4. scx_bpf_create_dsq
Creates a new dispatch queue. By default sched_ext creates a global FIFO dsq and a local FIFO dsq; this API allows creation of custom queues, e.g., CFS‑style queues based on virtual time. The counterpart scx_bpf_destroy_dsq deletes a dsq by its ID.
5. scx_bpf_select_cpu_dfl
Applies the default CPU selection strategy to choose a CPU for task execution.
6. scx_bpf_kick_cpu
Sends an inter‑processor interrupt (IPI) to another CPU, waking an idle CPU ( SCX_KICK_IDLE) or pre‑empting the currently running task ( SCX_KICK_PREEMPT).
7. scx_bpf_dsq_nr_queued
Returns the number of tasks currently queued on a dsq.
3. Summary
sched_ext builds on the two foundational mechanisms of struct_ops and kfunc. The combination enables high‑performance schedulers such as scx_lavd, scx_rusty, and scx_bpfland.
4. References
https://lore.kernel.org/bpf/[email protected]/
https://lore.kernel.org/bpf/[email protected]/
https://blogs.igalia.com/changwoo/sched-ext-scheduler-architecture-and-interfaces-part-2/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
