Fundamentals 6 min read

Optimizing this_cpu_ops on ARM64: Insights from Yang Shi’s LSF/MM/BPF Series

The article explains how per‑CPU data is accessed on x86 versus ARM64, details Yang Shi’s proposal to make this_cpu_* APIs return a uniform virtual pointer across CPUs, and presents benchmark results showing up to 18% kernel sys‑time improvement and 8.5% stress‑ng speedup.

Linux Kernel Journey

May 13, 2026

Optimizing this_cpu_ops on ARM64: Insights from Yang Shi’s LSF/MM/BPF Series

Per‑CPU data access in the Linux kernel

Each CPU has a dedicated per‑CPU memory region. On x86_64 the gs segment register points to that region, so this_cpu_*() operations compile to a single atomic instruction, e.g. mov ax, gs:[x] or inc gs:[x].

On architectures without a segment register (e.g., ARM64) the same operation requires disabling preemption, obtaining the current CPU id, adding the per‑CPU offset from __per_cpu_offset[], and then re‑enabling preemption. Typical implementations are:

#define get_cpu() ({ preempt_disable(); __smp_processor_id(); })
#define put_cpu() preempt_enable()
#define per_cpu_ptr(ptr, cpu) ({ __verify_pcpu_ptr(ptr); SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); })

Consequently, accessing a per‑CPU variable involves code such as:

int *y;
int cpu;
cpu = get_cpu();
 y = per_cpu_ptr(&x, cpu);
 (*y)++;
put_cpu();

Proposed redesign (Yang Shi)

The proposal changes the this_cpu_*() API so that the per‑CPU allocator returns a pointer that has the same virtual address on every CPU, while the underlying physical page differs per CPU. This removes the need for preempt_disable()/preempt_enable() and the __per_cpu_offset[] lookup.

Implementation requires each CPU to have its own kernel page table that shares the majority of entries but contains a distinct local mapping for the per‑CPU data. The same virtual address therefore maps to different physical pages on CPUs 0‑3, as illustrated in the diagrams.

Performance evaluation

An RFC patchset was applied to a 160‑core AmpereOne system (kernel 7.1‑rc1, 4 KB pages). Measurements show:

13 %–18 % reduction in kernel‑mode sys time.

3 %–7 % reduction in overall wall‑clock time.

Running stress-ng --vm 160 --vm-bytes 128M --vm-ops 100000000 yields an additional 6 %–8.5 % improvement.

Build command used:

make -j160

References

Yang Shi slides: https://lore.kernel.org/linux-mm/[email protected]/2-percpu_LSF2026.pdf

RFC patchset: https://lore.kernel.org/linux-mm/[email protected]/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance optimization Linux kernel ARM64 per-CPU this_cpu_ops

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.