Operations 15 min read

Master CPU & Memory Subsystem Tuning on Kunpeng Processors: Tools & Strategies

This article introduces practical CPU and memory subsystem performance tuning for Kunpeng processors, covering optimization concepts, key parameters, common monitoring tools such as top, perf and numactl, and detailed methods like NUMA binding, prefetch control, timer tuning, TLB page size adjustment, and thread concurrency optimization.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Master CPU & Memory Subsystem Tuning on Kunpeng Processors: Tools & Strategies

CPU & Memory Subsystem Performance Tuning

The Special Forces team, established at the end of June, has been sharing a series of articles to help developers on Kunpeng processors with software development and performance optimization. This part focuses on CPU and memory subsystem tuning.

1. Overview of CPU & Memory Subsystem Tuning

Optimization ideas

If CPU utilization is low, use tools like strace to locate blocking I/O, network, or sleep points.

If CPU utilization is high, adjust software and hardware configuration parameters to better fit the workload and reduce CPU usage.

Choose appropriate memory modules: use full‑channel configuration and high‑frequency DIMMs. A Kunpeng 920 processor has 8 memory channels (16 when using two CPUs) and supports up to 2933 MHz.

Main Tuning Parameters

2. Common Performance Monitoring Tools

2.1 top

Introduction: top is a widely used Linux performance monitoring tool that shows process and overall system performance. top: View overall CPU and memory consumption. top then press 1: Show per‑CPU core usage.

Press F and select P: Check if threads are scheduled on other CPU cores. top -p $PID -H: Show CPU usage of all threads of a specific process.

Installation: the tool is built into the system; no extra installation required.

Key fields in top output:

us : User‑mode CPU time percentage.

sy : Kernel‑mode CPU time percentage.

wa : I/O wait percentage.

hi : Hardware interrupt percentage.

si : Software interrupt percentage.

Memory fields:

KiB Mem : Total and used memory.

KiB Swap : Swap usage; if non‑zero, consider reducing memory consumption or adding RAM.

2.2 perf

Introduction: perf is a powerful Linux profiling tool for capturing call stacks, resource consumption, and hotspot functions. perf top: Show current hotspot functions. perf sched record -- sleep 1 -p $PID: Record system calls of a process for 1 second. perf sched latency --sort max: Analyze recorded data sorted by maximum latency.

Installation (CentOS): # yum -y install perf Usage example: run perf top to find hotspot functions, then use perf sched record and perf sched latency to analyze scheduling delays.

2.3 numactl

Introduction: numactl displays NUMA node configuration and can bind processes to specific CPU cores. numactl -H: Show current NUMA configuration. numactl -C 0-7 ./test: Bind the program test to cores 0‑7. numastat: Show NUMA runtime statistics.

Installation (CentOS): # yum -y install numactl numastat Typical workflow: check NUMA layout, bind processes with numactl -C, and monitor node memory hit/miss rates with numastat. numa_hit indicates local memory accesses; numa_miss indicates remote accesses, which should be minimized.

3. Optimization Methods

3.1 NUMA Optimization

Reduce cross‑NUMA memory accesses by setting thread CPU affinity (e.g., numactl -C 28-31 ./test or using sched_setaffinity in code). Open‑source services like Nginx support affinity via configuration.

3.2 CPU Prefetch Switch

Enable or disable CPU prefetch based on workload locality. Enable for workloads with high spatial/temporal locality (e.g., SPEC CPU, X265); disable for memory‑intensive services like STREAM, Nginx, databases. Adjust the setting in the BIOS.

3.3 Timer Mechanism Adjustment

Disable unnecessary clock interrupts by enabling the nohz kernel feature. Verify with cat /proc/cmdline and remove nohz=off if present, then reboot.

3.4 Increase Memory Page Size to 64 KB

Larger pages improve TLB hit rate. Recompile the kernel with PAGESIZE=64K (via make menuconfig → Kernel Features → Page size (64KB)) and reinstall.

Check TLB miss rates with perf stat -p $PID -d -d -d.

3.5 Adjust Thread Concurrency

Find the optimal number of threads for a given workload; performance often peaks before linear scaling. Example adjustments: innodb_thread_concurrency for MySQL, worker_processes for Nginx.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Memory OptimizationtopNUMAperfnumactlKunpengCPU tuningLinux performance
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.