Operations 12 min read

Boost Network Performance on Kunpeng CPUs: Tuning Tips & Tools

This guide explains how to improve network subsystem performance on Kunpeng processors by using tools such as ethtool and strace, adjusting PCIe payload size, binding NIC interrupts to NUMA‑local cores, tweaking interrupt coalescing, enabling TSO, and replacing select with epoll for high‑concurrency workloads.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Boost Network Performance on Kunpeng CPUs: Tuning Tips & Tools

1. Tuning Overview

The article introduces performance tuning for the network subsystem on Kunpeng processors, focusing on optimizing NIC performance and offloading CPU work in high‑concurrency scenarios. It recommends using two NICs bound to different CPUs and, when possible, selecting x16 PCIe cards.

2. Common Performance Monitoring Tools

2.1 ethtool

ethtool is a powerful Linux network management utility supported by most NIC drivers. It can query and configure NIC status, driver version, link speed, and offload capabilities.

Installation (CentOS): # yum -y install ethtool net-tools Basic usage: ethtool [options] Typical commands and output examples: # ethtool -k eth0 Features for eth0:

rx-checksumming: on

tx-checksumming: on

scatter-gather: on

tcp-segmentation-offload: on # ethtool -l eth0 Channel parameters for eth0: … Combined: 8 # ethtool -c eth0 Coalesce parameters for eth0: Adaptive RX: off  TX: off … rx-usecs: 30 … tx-usecs: 30

2.2 strace

strace is a Linux debugging tool that traces system calls of a program, printing call name, arguments, and return values.

Installation (CentOS): # yum -y install strace Usage: strace [options] program Common options:

-T Show time spent in each call.

-tt Prefix each line with a microsecond‑resolution timestamp.

-p Trace a specific process ID.

Example output:

18:25:47.902439 epoll_pwait(716, [{EPOLLIN, …}], 512, 1, NULL, 8) = 5 <0.000038>

3. Optimization Methods

3.1 PCIe Max Payload Size Configuration

The NIC transfers data to CPU memory via the PCIe bus. Increasing the Max Payload Size (bytes per transaction) improves PCIe bandwidth utilization.

Change in BIOS: set Advanced → Max Payload Size to 512B .

3.2 Network NUMA Core Binding

When a NIC receives many packets, it generates interrupts that the kernel handles. With a single queue, one core processes all packets, limiting scalability. Enabling multi‑queue allows different cores to handle different queues, but if the interrupt‑handling core is on a different NUMA node than the NIC, cross‑NUMA memory accesses add latency.

To bind NIC interrupts to the NUMA node where the NIC resides:

Stop and disable irqbalance.

# systemctl stop irqbalance.service
# systemctl disable irqbalance.service

Set the number of NIC queues equal to the number of CPU cores. # ethtool -L ethx combined 48 Find the IRQ numbers for the NIC. # cat /proc/interrupts | grep $eth | awk -F ':' '{print $1}' Bind each IRQ to a specific core using its CPU mask.

# echo $cpuMask > /proc/irq/$irq/smp_affinity_list

3.3 Interrupt Coalescing Parameter Tuning

Interrupt coalescing lets the NIC delay generating an interrupt until a configurable number of packets or a timeout expires, reducing interrupt overhead.

Adjust with ethtool -C $eth (replace $eth with the NIC name):

# ethtool -C eth3 adaptive-rx off adaptive-tx off rx-usecs N rx-frames N tx-usecs N tx-frames N

Increasing the values for rx-usecs, rx-frames, tx-usecs, and tx-frames reduces the interrupt rate at the cost of a few microseconds of packet latency.

3.4 Enable TSO (TCP Segmentation Offload)

TSO offloads TCP segmentation to the NIC, allowing large data blocks to be sent without the kernel splitting them into MTU‑sized packets, thus lowering CPU usage and interrupt frequency.

Enable TSO: # ethtool -K $eth tso on Verify support: # ethtool -K $eth Typical output shows tx-checksumming:on, scatter-gather:on, and tcp-segmentation-offload:on.

3.5 Replace select with epoll

In high‑concurrency workloads, select suffers from a 1024‑fd limit and inefficient polling. epoll provides scalable event notification without a hard limit and reduces CPU overhead.

Key epoll functions:

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events, int maxevents, int timeout, const sigset_t *sigmask);

Typical usage replaces select loops with epoll event registration and waiting, dramatically improving performance under heavy load.

Example to verify epoll usage on a running process:

# strace -p $TID -T -tt
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

epollNUMAstraceethtoolNetwork TuningTSOKunpeng
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.