From Zero to One: Mastering Linux NAPI High‑Concurrency Packet Reception

This article explains why traditional interrupt‑driven network I/O stalls under high traffic, introduces the NAPI (New API) mechanism that combines interrupt wake‑up with batch polling, details its core data structures and scheduling functions, and provides step‑by‑step configuration and tuning guidance to achieve efficient, low‑latency packet processing on Linux servers.

Deepin Linux
Deepin Linux
Deepin Linux
From Zero to One: Mastering Linux NAPI High‑Concurrency Packet Reception

In high‑concurrency network scenarios, packet‑receive performance, throughput, and CPU utilization determine a service's capacity. Many developers encounter CPU spikes, packet stalls, or interrupt storms without understanding the root cause, which often lies in the traditional interrupt‑driven I/O model.

1. Problems of the Traditional Interrupt‑Driven Model

The classic model triggers a hardware interrupt for each incoming packet. The CPU saves the current task state, jumps to the interrupt service routine (ISR), copies the packet from the NIC buffer to kernel memory, performs basic checks, and then restores the task. Under heavy load (hundreds of thousands of packets per second), this leads to an "interrupt storm" where the CPU spends >80% of its time handling interrupts, suffers cache misses, and incurs costly context switches, degrading latency and overall system stability.

2. NAPI: The New API

2.1 What Is NAPI?

NAPI (New API) is a Linux kernel networking subsystem that mitigates interrupt storms by using a hybrid "interrupt + polling" approach. When traffic is low, interrupts wake the CPU; when traffic exceeds a threshold, NAPI disables the hardware interrupt and switches to batch polling, processing many packets per schedule.

2.2 How NAPI Works

Upon packet arrival, the NIC raises a hard interrupt. The interrupt handler records the event, attaches the device's napi_struct to the CPU's softnet receive queue, and disables further hard interrupts. A soft interrupt then invokes the driver’s poll function, which repeatedly calls the driver‑provided poll callback until either the configured weight (maximum packets per poll) is reached or the receive queue is empty. After processing, the driver either re‑enables the hard interrupt (if the queue is empty) or re‑queues the napi_struct for the next soft interrupt.

This batch processing dramatically reduces the number of hard interrupts—e.g., from thousands per second to a few dozen—while also lowering cache invalidation and context‑switch overhead.

2.3 Core Data Structures and Functions

The central structure is struct napi_struct, which holds the poll list, state flags, weight, GRO counters, callback pointer, associated net_device, and timers. Key functions include: netif_napi_add(): registers a napi_struct with a network device and sets the poll callback. napi_schedule_prep() and __napi_schedule(): atomically check scheduling eligibility and attach the napi_struct to the softnet queue. napi_schedule(): wrapper that performs the prep check and actual scheduling. napi_poll(): NET_RX soft‑interrupt handler that invokes the driver’s poll function, respects the weight, and decides whether to finish or re‑queue. napi_gro_receive(): merges packets via Generic Receive Offload (GRO) and hands them to the protocol stack.

3. Configuring NAPI for High‑Concurrency Reception

3.1 Hardware Selection

Choose a NIC that supports NAPI natively, such as Intel I350‑T4, which offers multiple 1 GbE ports and configurable receive‑queue depth. Larger queue depths allow more packets to be buffered during traffic spikes.

3.2 Kernel Parameters

Adjust net.core.netdev_max_backlog (default 1000) to a higher value (e.g., 5000) to enlarge the kernel’s receive queue and prevent packet drops under load. Increase net.ipv4.tcp_max_syn_backlog (e.g., to 2048) to accommodate more half‑open TCP connections and mitigate SYN‑flood effects.

Persist changes via /etc/sysctl.conf and apply with sysctl -p.

3.3 Driver Configuration

Ensure the driver version matches the kernel and that NAPI support is enabled (e.g., CONFIG_E1000_NAPI for the e1000 driver). Register the poll callback with netif_napi_add() during device initialization.

4. Performance Tuning

4.1 CPU‑Centric Optimizations

Enable Receive Packet Steering (RPS) to hash packet flow attributes and distribute packets across CPUs, reducing per‑CPU load. Configure /sys/class/net/eth0/queues/rx-0/rps_cpus (e.g., echo "0x3" for CPUs 0 and 1). Use Receive Flow Steering (RFS) to keep packets of the same flow on the same CPU, improving cache reuse. Set CPU affinity for interrupt handling via /proc/irq/<em>irq</em>/smp_affinity and bind high‑traffic processes with taskset or sched_setaffinity.

4.2 Memory‑Related Tuning

Increase TCP buffers ( net.ipv4.tcp_rmem and net.ipv4.tcp_wmem) to values such as "4096 65536 131072" to provide larger receive and send windows. Raise net.core.rmem_max and net.core.rmem_default (e.g., to 16777216 and 262144) to enlarge socket receive buffers, reducing drops caused by buffer overflow.

4.3 Network Stack Adjustments

Extend the SYN queue length ( net.ipv4.tcp_max_syn_backlog) and enable delayed ACKs by lowering net.ipv4.tcp_delack_time (e.g., to 50 ms) to batch acknowledgments and cut ACK overhead.

5. Pitfalls and Troubleshooting

5.1 Common Issues

Packet loss often stems from receive‑queue overflow or delayed interrupt handling. High CPU utilization can cause scheduling imbalance, leading to latency spikes.

5.2 Diagnosis Tools

tcpdump -i eth0

to capture traffic and verify loss patterns. ethtool -S eth0 to inspect NIC statistics such as rx_dropped. top and top -H -p <em>pid</em> to locate CPU‑heavy threads.

5.3 Remedies

Increase receive buffers ( net.core.rmem_max, net.core.rmem_default), bind NIC interrupts to less‑loaded CPUs, and streamline interrupt handlers to avoid long‑running operations. Use cgroups or adjust process priorities to prevent CPU contention.

By understanding the underlying NAPI workflow, configuring appropriate hardware and kernel parameters, and applying targeted CPU, memory, and stack optimizations, developers can achieve stable, high‑throughput packet processing on modern Linux servers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kernelperformance tuningLinuxhigh concurrencyNetworkingNAPIpacket processing
Deepin Linux
Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.