Fundamentals 8 min read

How Merging cfs_rq and sched_entity Boosts Linux Scheduler Cache Hits by 31%

Google engineer Zecheng Li’s patch series merges the CFS scheduler’s cfs_rq and sched_entity structures and switches to a per‑CPU allocator, cutting LLC miss rates by up to 31% and raising kernel IPC by up to 25% on AMD and Intel platforms, with negligible regression.

Linux Kernel Journey

Sep 30, 2025

How Merging cfs_rq and sched_entity Boosts Linux Scheduler Cache Hits by 31%

Abstract

On the Linux 6.15 test kernel, Google engineer Zecheng Li submitted a patch series that co‑locates the cfs_rq (Completely Fair Scheduler run‑queue) and sched_entity structures and replaces the original allocator with a per‑CPU allocator. The changes dramatically reduce pointer indirection and cache misses. In large‑scale cgroup hierarchies on AMD hardware, LLC miss rates drop by as much as 31% and kernel IPC improves by 25%, while other benchmarks show almost no regression. The patches are slated for inclusion in Linux 6.18.

Background: CFS Scheduler and Data Locality

The Linux Completely Fair Scheduler (CFS) is the default CPU scheduler. Each task_group maintains a struct cfs_rq (run‑queue) that stores scheduling data, and a corresponding struct sched_entity that represents the run‑queue’s position in its parent run‑queue. The original implementation uses two pointer arrays, tg->cfs_rq and tg->se, to manage these structures. Every access requires dereferencing a pointer and then jumping to the target memory location. Although the memory is allocated locally to a NUMA node via kzalloc_node, frequent accesses to different CPUs or task groups in deep cgroup hierarchies still cause many LLC (last‑level cache) misses.

Core Improvements: Structure Co‑location + Per‑CPU Allocation

Key optimizations in the patch series:

1. Embed sched_entity inside cfs_rq

Previously, accessing the parent run‑queue’s sched_entity required multiple pointer hops, e.g.: cfs_rq->tg->se[cpu] After the change, the sched_entity fields are directly embedded in cfs_rq, so only an offset calculation is needed, greatly reducing indirect pointer accesses. The trade‑off is a modest extra memory cost: the root task_group now also allocates a sched_entity per CPU, which is negligible (CPU count × sizeof(sched_entity)).

2. Use a per‑CPU allocator for cfs_rq

In hot paths, the scheduler repeatedly iterates over multiple task_group s on the same CPU. Allocating cfs_rq per‑CPU improves cache affinity and hit rate, and eliminates the original pointer‑array memory.

Memory Layout Before and After

Original layout

tg -> cfs_rq pointers -> cfs_rq[]
 tg -> se pointers -> sched_entity[]

Optimized layout

tg percpu offset ->
  [CPU0] cfs_rq + sched_entity
  [CPU1] cfs_rq + sched_entity
  ...

This "structure co‑location" plus per‑CPU layout significantly improves data locality.

Performance Evaluation: Cache Miss Reduction

The author built tree‑shaped cgroup hierarchies (varying width and depth) and ran the schbench workload with an 80% CPU quota and a 10 ms bandwidth period on both Intel and AMD machines. The results (excerpted from the patch description) are:

Kernel LLC Misses (M)
                depth=3 width=10      depth=5 width=4
AMD‑orig        [2218.98, 2241.89]    [2599.80, 2645.16]
AMD‑opt         [1957.62, 1981.55]    [2380.47, 2431.86]
Change          -11.69%               -8.24%
Intel‑orig      [1580.53, 1604.90]    [2125.37, 2208.68]
Intel‑opt       [1066.94, 1100.19]    [1543.77, 1570.83]
Change          -31.96%               -28.13%

In addition, kernel IPC on the AMD platform increased by 25%, and the Intel platform saw a 3% gain. Other workloads without CPU‑share limits (e.g., sysbench, hackbench, ebizzy) exhibited virtually no performance regression.

Target Scenarios: Large‑Scale cgroup Hierarchies

The optimization is most beneficial for:

Scheduling and throttling in cgroup hierarchies with thousands of instances.

Hot paths that frequently access scheduler queues and entities (e.g., pick_next_task).

For small‑scale or simple setups the gains may be modest, but in cloud platforms, multi‑tenant scheduling, and container‑quota environments the improved cache hit rate directly translates to lower latency and higher throughput.

Conclusion and Outlook

Zecheng Li’s patch series demonstrates that "structure co‑location + per‑CPU allocation" can cut cache misses in the CFS scheduler’s critical path, delivering noticeable performance improvements with almost no side effects. Future work may explore:

Further layout optimizations of other scheduler data structures.

Adjustments for NUMA‑cross‑node scenarios.

Regression testing to ensure stability across edge cases.

The patch series (v4) is actively discussed on LKML and is expected to merge in Linux 6.18.

Patch and discussion links:

https://patchew.org/linux/[email protected]/

https://www.spinics.net/lists/kernel/msg5835594.html

https://patchew.org/linux/[email protected]/[email protected]/

https://lkml.rescloud.iu.edu/2507.0/01653.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scheduler Linux cache optimization CFS per-CPU kernel performance

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.