Boost Kunpeng Server Apps: 7 Proven Performance Tuning Techniques

This guide walks you through seven practical optimization methods for Kunpeng‑based servers—including compiler flags, buffer selection, result caching, memory‑copy reduction, lock refinement, jemalloc integration, and cache‑line alignment—to fully exploit the hardware’s capabilities.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Boost Kunpeng Server Apps: 7 Proven Performance Tuning Techniques

1. Introduction

After deploying an application on a Kunpeng server, developers should tailor the code to the chip and server characteristics so that hardware potential is fully utilized. This chapter presents typical scenarios covering locks, compiler configuration, cache‑line usage, and buffering mechanisms.

2. Optimization Methods

2.1 Optimize Compile Options

Principle : GCC translates source code into CPU instructions; pipeline efficiency depends on instruction ordering, resource usage, and data dependencies. By informing the compiler of the target CPU (ARMv8, TSV110 pipeline), better instruction scheduling is achieved.

Modification :

On Euler systems using the HCC compiler, add to CFLAGS and CPPFLAGS: -mtune=tsv110 -march=armv8-a On other OSes, upgrade GCC to 9.10 and add the same flags.

2.2 Choose File Buffer Mechanism

Principle : Memory access is faster than disk access, so applications typically use buffers to reduce direct disk I/O. Two main mechanisms are:

clibbuffer : User‑space buffer that delays syncing to the kernel until a threshold or explicit trigger, reducing user‑kernel switches.

PageCache : Kernel‑space cache that eventually writes back to disk, also reducing disk accesses.

Modification :

Use read/write for large‑chunk I/O to avoid extra memory copy, or fread/fwrite for many small calls to reduce system‑call overhead.

Enable O_DIRECT when the application provides its own buffering and data is read only once.

2.3 Execution Result Caching

Principle : Identical inputs produce identical outputs; caching results avoids recomputation.

Modification :

Enable Nginx proxy_cache_path for HTTP response caching.

Use JIT compilation to cache generated machine code.

Configure MySQL query cache via query_cache_size, query_cache_type, and monitor with SHOW STATUS LIKE '%Qcache%';.

2.4 Reduce Memory Copies

Principle : Fewer memory copies lower CPU usage and memory‑bandwidth pressure.

Modification :

Replace read / write sequences with sendfile to achieve only two copies (disk → kernel cache → NIC).

Use shared memory (e.g., shmget) instead of sockets/pipes for inter‑process communication.

2.5 Lock Optimization

Principle : Spin locks and CAS loops waste CPU cycles while waiting for atomic operations to succeed.

Modification :

Replace a single large lock with finer‑grained locks per core or thread.

Prefer ldaxr + stlxr over ldxr / stxr + dmb ish for better performance.

Reduce thread concurrency where possible.

Align lock variables to cache‑line boundaries to avoid false sharing.

Use atomic_add_return instead of manual read‑modify‑write loops.

2.6 Use jemalloc for Memory Allocation

Principle : jemalloc provides higher allocation throughput and lower fragmentation in multithreaded workloads by giving each thread its own arena, eliminating contention.

Modification :

Download and compile jemalloc from https://github.com/jemalloc/jemalloc.

Link applications with jemalloc using:

-I$(jemalloc-config --includedir) -L$(jemalloc-config --libdir) -Wl,-rpath,$(jemalloc-config --libdir) -ljemalloc $(jemalloc-config --libs)

Configure MySQL to use jemalloc by adding malloc-lib=/usr/local/lib/libjemalloc.so to my.cnf.

2.7 Cache‑Line Optimization

Principle : CPUs cache data in cache‑lines; misaligned or shared variables can cause false sharing, degrading cache‑hit rates.

Modification :

Align frequently accessed data to the cache‑line size (e.g., 128 bytes on Kunpeng 920) using posix_memalign(void **memptr, size_t alignment, size_t size).

Pad structures manually, e.g.:

int writeHighFreq;
char pad[CACHE_LINE_SIZE - sizeof(int)];

Adjust macro definitions in open‑source projects (e.g., CACHE_LINE_SIZE in Impala) to match the target platform.

Conclusion

By applying these seven optimization techniques—compiler tuning, appropriate buffering, result caching, minimizing memory copies, refined locking, jemalloc integration, and cache‑line alignment—developers can significantly improve the performance of applications running on Kunpeng servers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceoptimizationMemoryjemallocCompiler FlagsKunpengcacheline
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.