Optimizing this_cpu_ops on ARM64: Insights from Yang Shi’s LSF/MM/BPF Series
The article explains how per‑CPU data is accessed on x86 versus ARM64, details Yang Shi’s proposal to make this_cpu_* APIs return a uniform virtual pointer across CPUs, and presents benchmark results showing up to 18% kernel sys‑time improvement and 8.5% stress‑ng speedup.
Per‑CPU data access in the Linux kernel
Each CPU has a dedicated per‑CPU memory region. On x86_64 the gs segment register points to that region, so this_cpu_*() operations compile to a single atomic instruction, e.g. mov ax, gs:[x] or inc gs:[x].
On architectures without a segment register (e.g., ARM64) the same operation requires disabling preemption, obtaining the current CPU id, adding the per‑CPU offset from __per_cpu_offset[], and then re‑enabling preemption. Typical implementations are:
#define get_cpu() ({ preempt_disable(); __smp_processor_id(); })
#define put_cpu() preempt_enable()
#define per_cpu_ptr(ptr, cpu) ({ __verify_pcpu_ptr(ptr); SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); })Consequently, accessing a per‑CPU variable involves code such as:
int *y;
int cpu;
cpu = get_cpu();
y = per_cpu_ptr(&x, cpu);
(*y)++;
put_cpu();Proposed redesign (Yang Shi)
The proposal changes the this_cpu_*() API so that the per‑CPU allocator returns a pointer that has the same virtual address on every CPU, while the underlying physical page differs per CPU. This removes the need for preempt_disable()/preempt_enable() and the __per_cpu_offset[] lookup.
Implementation requires each CPU to have its own kernel page table that shares the majority of entries but contains a distinct local mapping for the per‑CPU data. The same virtual address therefore maps to different physical pages on CPUs 0‑3, as illustrated in the diagrams.
Performance evaluation
An RFC patchset was applied to a 160‑core AmpereOne system (kernel 7.1‑rc1, 4 KB pages). Measurements show:
13 %–18 % reduction in kernel‑mode sys time.
3 %–7 % reduction in overall wall‑clock time.
Running stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000 yields an additional 6 %–8.5 % improvement.
Build command used:
make -j160References
Yang Shi slides: https://lore.kernel.org/linux-mm/[email protected]/2-percpu_LSF2026.pdf
RFC patchset: https://lore.kernel.org/linux-mm/[email protected]/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
