Master CPU & Memory Subsystem Tuning on Kunpeng Processors: Tools & Strategies
This article introduces practical CPU and memory subsystem performance tuning for Kunpeng processors, covering optimization concepts, key parameters, common monitoring tools such as top, perf and numactl, and detailed methods like NUMA binding, prefetch control, timer tuning, TLB page size adjustment, and thread concurrency optimization.
CPU & Memory Subsystem Performance Tuning
The Special Forces team, established at the end of June, has been sharing a series of articles to help developers on Kunpeng processors with software development and performance optimization. This part focuses on CPU and memory subsystem tuning.
1. Overview of CPU & Memory Subsystem Tuning
Optimization ideas
If CPU utilization is low, use tools like strace to locate blocking I/O, network, or sleep points.
If CPU utilization is high, adjust software and hardware configuration parameters to better fit the workload and reduce CPU usage.
Choose appropriate memory modules: use full‑channel configuration and high‑frequency DIMMs. A Kunpeng 920 processor has 8 memory channels (16 when using two CPUs) and supports up to 2933 MHz.
Main Tuning Parameters
2. Common Performance Monitoring Tools
2.1 top
Introduction: top is a widely used Linux performance monitoring tool that shows process and overall system performance. top: View overall CPU and memory consumption. top then press 1: Show per‑CPU core usage.
Press F and select P: Check if threads are scheduled on other CPU cores. top -p $PID -H: Show CPU usage of all threads of a specific process.
Installation: the tool is built into the system; no extra installation required.
Key fields in top output:
us : User‑mode CPU time percentage.
sy : Kernel‑mode CPU time percentage.
wa : I/O wait percentage.
hi : Hardware interrupt percentage.
si : Software interrupt percentage.
Memory fields:
KiB Mem : Total and used memory.
KiB Swap : Swap usage; if non‑zero, consider reducing memory consumption or adding RAM.
2.2 perf
Introduction: perf is a powerful Linux profiling tool for capturing call stacks, resource consumption, and hotspot functions. perf top: Show current hotspot functions. perf sched record -- sleep 1 -p $PID: Record system calls of a process for 1 second. perf sched latency --sort max: Analyze recorded data sorted by maximum latency.
Installation (CentOS): # yum -y install perf Usage example: run perf top to find hotspot functions, then use perf sched record and perf sched latency to analyze scheduling delays.
2.3 numactl
Introduction: numactl displays NUMA node configuration and can bind processes to specific CPU cores. numactl -H: Show current NUMA configuration. numactl -C 0-7 ./test: Bind the program test to cores 0‑7. numastat: Show NUMA runtime statistics.
Installation (CentOS): # yum -y install numactl numastat Typical workflow: check NUMA layout, bind processes with numactl -C, and monitor node memory hit/miss rates with numastat. numa_hit indicates local memory accesses; numa_miss indicates remote accesses, which should be minimized.
3. Optimization Methods
3.1 NUMA Optimization
Reduce cross‑NUMA memory accesses by setting thread CPU affinity (e.g., numactl -C 28-31 ./test or using sched_setaffinity in code). Open‑source services like Nginx support affinity via configuration.
3.2 CPU Prefetch Switch
Enable or disable CPU prefetch based on workload locality. Enable for workloads with high spatial/temporal locality (e.g., SPEC CPU, X265); disable for memory‑intensive services like STREAM, Nginx, databases. Adjust the setting in the BIOS.
3.3 Timer Mechanism Adjustment
Disable unnecessary clock interrupts by enabling the nohz kernel feature. Verify with cat /proc/cmdline and remove nohz=off if present, then reboot.
3.4 Increase Memory Page Size to 64 KB
Larger pages improve TLB hit rate. Recompile the kernel with PAGESIZE=64K (via make menuconfig → Kernel Features → Page size (64KB)) and reinstall.
Check TLB miss rates with perf stat -p $PID -d -d -d.
3.5 Adjust Thread Concurrency
Find the optimal number of threads for a given workload; performance often peaks before linear scaling. Example adjustments: innodb_thread_concurrency for MySQL, worker_processes for Nginx.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Huawei Cloud Developer Alliance
The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
