Unlocking Kunpeng CPU Performance: Real-World Optimization Techniques and Benchmarks
This article provides a comprehensive, step‑by‑step guide to tuning Kunpeng‑based servers, covering hardware characteristics, matrix‑multiplication benchmarks, NUMA‑aware scheduling, compiler and JDK optimizations, acceleration libraries, disk and NIC tuning, and a practical MariaDB performance‑tuning workflow.
Why Kunpeng Performance Tuning Matters
Kunpeng‑based TaiShan servers rank among the top domestic servers due to high compute efficiency and a growing ecosystem of native applications and community support. Understanding both hardware traits and software‑level optimizations is essential for extracting maximum performance.
Matrix‑Multiplication Case Study
Multiplying a 4800×4800 matrix illustrates the impact of different implementations:
Pure Python: ~61,162 s
Single‑threaded C: ~757 s
C with multithreading: ~47 s
C + cache‑aware tuning: ~6.02 s
Kunpeng NEON vector instructions: ~1.99 s
The final result shows a >30,000× speedup over Python and a 3× gain from vectorization.
Performance Tuning from a Von Neumann Perspective
The classic architecture can be abstracted into four tunable components: CPU/Memory, Network Interface Card (NIC), Disk, and Application. Optimizing each layer yields cumulative gains.
Kunpeng Soft and Hard Acceleration Overview
Kunpeng offers both software acceleration (single‑core and multi‑core) and hardware acceleration engines (compression, encryption, multimedia, etc.). These can deliver 10%‑100% performance improvements depending on the workload.
Compiler and JDK Optimizations for Single‑Core Speedup
Huawei’s compiler applies several techniques:
Instruction layout optimization – split functions and reorder hot/cold code to improve instruction‑cache hit rate.
Memory layout optimization – group frequently accessed data to boost data‑cache hit rate.
Loop optimization – parallelize independent loops across cores and auto‑vectorize dependency‑free code.
For Java, the BiSheng JDK adds:
JIT and GC improvements for faster memory management.
JVM loop, vectorization, and serialization enhancements.
NUMA‑Based Multi‑Core Optimization
Modern CPUs use multi‑core designs. In symmetric multiprocessing (SMP), all cores share a single memory controller, creating a bottleneck as core counts rise. NUMA partitions cores into nodes, each with its own memory controller, reducing contention.
Because memory is physically distributed, access latency varies by node. Kunpeng mitigates this with NUMA‑aware affinity planning, shortening the distance between processes and their memory.
NUMA Optimization Example with Nginx
Nginx, a high‑performance web server, normally copies data twice (NIC→kernel, kernel→worker). By binding NIC, kernel, and Nginx workers to the same NUMA node, end‑to‑end latency improves by ~15%.
Three practical ways to set NUMA affinity:
Use numactl -C 0-15 process_name to bind a process to specific cores.
Call
int sched_setaffinity(pid_t pid, size_t cpusetsize, cpu_set_t*mask)from code.
Configure worker_cpu_affinity in nginx.conf for open‑source software that supports it.
Building Acceleration Libraries on Kunpeng
Beyond software acceleration, Kunpeng provides nine hardware‑offload libraries for basic, compression, crypto, and multimedia workloads, delivering 10%‑100% speedups in typical scenarios.
OpenSSL and Compression Acceleration
When a web service calls OpenSSL, the hardware acceleration library can be loaded (path 2) without code changes, offloading cryptographic work to the chip.
Kunpeng also supports gzip/zlib, ZSTD, and Snappy compression engines, dramatically reducing compression time.
Real‑World Case: JD.com HTTPS Acceleration
JD.com switched from a QAT‑based solution to Kunpeng’s RSA acceleration engine, achieving a 33% improvement in HTTPS short‑connection latency.
Disk and NIC Optimizations for a Better Runtime Environment
Disk I/O can be accelerated by using XFS, enabling read‑ahead, tuning dirty‑page flushing, and applying I/O scheduler tweaks.
NIC interrupt frequency affects throughput and latency. Adjusting interrupt coalescing and affinity balances low latency with high throughput.
Application‑Level Tuning to Fully Exploit Hardware
Increasing concurrency, caching, and asynchronous I/O can boost performance, but excessive threads or cache usage may cause lock contention and cache‑line thrashing. Techniques include lock‑free programming, reducing lock granularity, and using high‑performance atomic instructions.
For example, Tcmalloc reduces allocation‑related locks, improving high‑concurrency performance.
Kunpeng’s 128‑byte cache line can cause false sharing. Placing hot and cold variables in separate cache lines avoids unnecessary invalidations.
Adjusting MySQL’s cache‑line size from 64 B to Kunpeng’s 128 B yields a ~5% performance gain.
Kunpeng Performance‑Tuning "Ten‑Sword" Checklist
Community‑derived best practices are grouped into four domains:
CPU/Memory : adjust page size, enable CPU prefetch, modify thread‑scheduling policy.
Disk : tune dirty‑page flushing, use asynchronous I/O (libaio), adjust filesystem parameters.
NIC : enable multi‑queue, turn on TSO, enable checksum offload.
Application : optimize compile options, implement file‑cache mechanisms, cache execution results, leverage NENO instructions.
MariaDB Performance‑Tuning Workflow Example
The process follows three iterative steps: monitoring, analysis, and optimization.
Monitoring – collect CPU (interrupts, time‑slice), memory (NUMA hit rate, usage), disk (iowait, utilization), and NIC (bandwidth) metrics.
Analysis – identify bottlenecks (e.g., CPU saturation when TPS peaks).
Optimization – increase concurrency, bind threads to NUMA nodes, use large pages, adjust InnoDB parameters such as innodb_thread_concurrency, innodb_sync_spin_loops, and innodb_spin_wait_delay.
Summary
CPU/Memory, Disk, NIC, and Application constitute the four main tuning dimensions.
Collect metrics, analyze bottlenecks, and iteratively optimize code and parameters.
Fully leveraging hardware resources unlocks the software’s optimal performance.
Balancing latency, throughput, and concurrency is essential for stable operation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
