Operations 17 min read

Unlocking Kunpeng CPU Performance: Real-World Optimization Techniques and Benchmarks

This article provides a comprehensive, step‑by‑step guide to tuning Kunpeng‑based servers, covering hardware characteristics, matrix‑multiplication benchmarks, NUMA‑aware scheduling, compiler and JDK optimizations, acceleration libraries, disk and NIC tuning, and a practical MariaDB performance‑tuning workflow.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Unlocking Kunpeng CPU Performance: Real-World Optimization Techniques and Benchmarks

Why Kunpeng Performance Tuning Matters

Kunpeng‑based TaiShan servers rank among the top domestic servers due to high compute efficiency and a growing ecosystem of native applications and community support. Understanding both hardware traits and software‑level optimizations is essential for extracting maximum performance.

Matrix‑Multiplication Case Study

Multiplying a 4800×4800 matrix illustrates the impact of different implementations:

Pure Python: ~61,162 s

Single‑threaded C: ~757 s

C with multithreading: ~47 s

C + cache‑aware tuning: ~6.02 s

Kunpeng NEON vector instructions: ~1.99 s

The final result shows a >30,000× speedup over Python and a 3× gain from vectorization.

Performance Tuning from a Von Neumann Perspective

The classic architecture can be abstracted into four tunable components: CPU/Memory, Network Interface Card (NIC), Disk, and Application. Optimizing each layer yields cumulative gains.

Kunpeng Soft and Hard Acceleration Overview

Kunpeng offers both software acceleration (single‑core and multi‑core) and hardware acceleration engines (compression, encryption, multimedia, etc.). These can deliver 10%‑100% performance improvements depending on the workload.

Compiler and JDK Optimizations for Single‑Core Speedup

Huawei’s compiler applies several techniques:

Instruction layout optimization – split functions and reorder hot/cold code to improve instruction‑cache hit rate.

Memory layout optimization – group frequently accessed data to boost data‑cache hit rate.

Loop optimization – parallelize independent loops across cores and auto‑vectorize dependency‑free code.

For Java, the BiSheng JDK adds:

JIT and GC improvements for faster memory management.

JVM loop, vectorization, and serialization enhancements.

NUMA‑Based Multi‑Core Optimization

Modern CPUs use multi‑core designs. In symmetric multiprocessing (SMP), all cores share a single memory controller, creating a bottleneck as core counts rise. NUMA partitions cores into nodes, each with its own memory controller, reducing contention.

Because memory is physically distributed, access latency varies by node. Kunpeng mitigates this with NUMA‑aware affinity planning, shortening the distance between processes and their memory.

NUMA Optimization Example with Nginx

Nginx, a high‑performance web server, normally copies data twice (NIC→kernel, kernel→worker). By binding NIC, kernel, and Nginx workers to the same NUMA node, end‑to‑end latency improves by ~15%.

Three practical ways to set NUMA affinity:

Use numactl -C 0-15 process_name to bind a process to specific cores.

Call

int sched_setaffinity(pid_t pid, size_t cpusetsize, cpu_set_t*mask)

from code.

Configure worker_cpu_affinity in nginx.conf for open‑source software that supports it.

Building Acceleration Libraries on Kunpeng

Beyond software acceleration, Kunpeng provides nine hardware‑offload libraries for basic, compression, crypto, and multimedia workloads, delivering 10%‑100% speedups in typical scenarios.

OpenSSL and Compression Acceleration

When a web service calls OpenSSL, the hardware acceleration library can be loaded (path 2) without code changes, offloading cryptographic work to the chip.

Kunpeng also supports gzip/zlib, ZSTD, and Snappy compression engines, dramatically reducing compression time.

Real‑World Case: JD.com HTTPS Acceleration

JD.com switched from a QAT‑based solution to Kunpeng’s RSA acceleration engine, achieving a 33% improvement in HTTPS short‑connection latency.

Disk and NIC Optimizations for a Better Runtime Environment

Disk I/O can be accelerated by using XFS, enabling read‑ahead, tuning dirty‑page flushing, and applying I/O scheduler tweaks.

NIC interrupt frequency affects throughput and latency. Adjusting interrupt coalescing and affinity balances low latency with high throughput.

Application‑Level Tuning to Fully Exploit Hardware

Increasing concurrency, caching, and asynchronous I/O can boost performance, but excessive threads or cache usage may cause lock contention and cache‑line thrashing. Techniques include lock‑free programming, reducing lock granularity, and using high‑performance atomic instructions.

For example, Tcmalloc reduces allocation‑related locks, improving high‑concurrency performance.

Kunpeng’s 128‑byte cache line can cause false sharing. Placing hot and cold variables in separate cache lines avoids unnecessary invalidations.

Adjusting MySQL’s cache‑line size from 64 B to Kunpeng’s 128 B yields a ~5% performance gain.

Kunpeng Performance‑Tuning "Ten‑Sword" Checklist

Community‑derived best practices are grouped into four domains:

CPU/Memory : adjust page size, enable CPU prefetch, modify thread‑scheduling policy.

Disk : tune dirty‑page flushing, use asynchronous I/O (libaio), adjust filesystem parameters.

NIC : enable multi‑queue, turn on TSO, enable checksum offload.

Application : optimize compile options, implement file‑cache mechanisms, cache execution results, leverage NENO instructions.

MariaDB Performance‑Tuning Workflow Example

The process follows three iterative steps: monitoring, analysis, and optimization.

Monitoring – collect CPU (interrupts, time‑slice), memory (NUMA hit rate, usage), disk (iowait, utilization), and NIC (bandwidth) metrics.

Analysis – identify bottlenecks (e.g., CPU saturation when TPS peaks).

Optimization – increase concurrency, bind threads to NUMA nodes, use large pages, adjust InnoDB parameters such as innodb_thread_concurrency, innodb_sync_spin_loops, and innodb_spin_wait_delay.

Summary

CPU/Memory, Disk, NIC, and Application constitute the four main tuning dimensions.

Collect metrics, analyze bottlenecks, and iteratively optimize code and parameters.

Fully leveraging hardware resources unlocks the software’s optimal performance.

Balancing latency, throughput, and concurrency is essential for stable operation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CPU optimizationperformance tuningLinuxServerNUMAKunpeng
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.