Operations 14 min read

Capturing 10 Million Packets per Second on Linux without Specialized Libraries

This article explains how to achieve multi‑million‑packet‑per‑second network capture on Linux 3.16 using standard C/C++ code, by distributing interrupts across cores, employing AF_PACKET with FANOUT and RX_RING, and optimizing memory handling to eliminate costly kernel locks and copies.

Architects Research Society
Architects Research Society
Architects Research Society
Capturing 10 Million Packets per Second on Linux without Specialized Libraries

In this article I describe how to capture up to 10 million packets per second on a Linux 3.16 system without relying on specialized libraries such as Netmap, PF_RING, or DPDK, using only standard C/C++ code.

Understanding the Limitations of pcap

Traditional packet capture tools based on pcap (e.g., iftop, tcpdump, arpwatch) suffer from high CPU load because each packet is copied from kernel space to user space via a recv system call, and the call is made for every packet. On modern 10 GE NICs this can mean more than 14 million system calls per second, which quickly becomes a bottleneck.

Additionally, most capture applications run on a single logical core, while the NIC may distribute incoming packets across many cores. Kernel locks are taken when multiple cores contend for the same resources, and these locks can consume up to 90 % of CPU time.

Distributing Interrupts to All Cores

Enabling promiscuous mode on the NIC:

ifconfig eth6 promisc

and then assigning each NIC queue to a different logical core reduces the load on any single core. The following script distributes the interrupts from all eight queues of an ixgbe NIC to the eight available logical CPUs:

#!/bin/bash
ncpus=$(grep -ciw ^processor /proc/cpuinfo)
[ "$ncpus" -gt 1 ] || exit 1
n=0
for irq in $(cat /proc/interrupts | grep eth | awk '{print $1}' | sed s/://g)
do
  f="/proc/irq/$irq/smp_affinity"
  [ -r "$f" ] || continue
  cpu=$(( ncpus - (n % ncpus) - 1 ))
  if [ $cpu -ge 0 ]; then
    mask=$(printf %x $((2 ** cpu)))
    echo "Assign SMP affinity: eth queue $n, irq $irq, cpu $cpu, mask 0x$mask"
    echo "$mask" > "$f"
    n=$((n+1))
  fi
done

After applying this script the observed packet‑per‑second rate rose to about 12 MPPS, and CPU usage became evenly distributed across cores.

AF_PACKET Capture without Optimizations

Running a basic AF_PACKET capture shows the problem: the single‑core application quickly saturates all CPUs, and profiling reveals that most time is spent in kernel spin locks and packet processing functions.

We process: 222048 pps
We process: 186315 pps

Using FANOUT to Parallelise Capture

By enabling PACKET_FANOUT_CP and spawning one capture process per logical core, the load is spread and lock contention disappears. The code flag bool use_multiple_fanout_processes = true activates this mode.

We process: 2250709 pps
We process: 2234301 pps
We process: 2266138 pps

CPU utilization remains high on each core, but profiling shows that the previous lock‑related hotspots have vanished.

RX_RING Circular Buffer Optimization

Further speed gains are achieved by using an RX_RING buffer, which eliminates the second memory copy from kernel to user space. With this technique the capture reaches roughly 4 MPPS.

We process: 3582498 pps
We process: 3757254 pps
We process: 3815506 pps

The approach also switches from per‑packet recv calls to a poll -based model that wakes the application only when a whole block of packets is ready.

Combining RX_RING with FANOUT

Applying FANOUT to the RX_RING setup further reduces lock contention and pushes the throughput to around 9 MPPS.

We process: 9611580 pps
We process: 8912556 pps
We process: 8941682 pps

Profiling now shows that the dominant kernel functions are the NIC driver’s interrupt handler and packet reception routines, while lock‑related overhead is minimal.

Conclusion

Linux provides a powerful platform for ultra‑high‑speed packet capture without the need for custom kernel modules. By distributing interrupts, using AF_PACKET with FANOUT, and employing RX_RING circular buffers, it is possible to approach the wire speed of 10 GE NICs on modest hardware.

Recommended reading:

packet_mmap documentation

packet(7) man page

Linuxnetwork optimizationpacket capturehigh performancemulticoreAF_PACKET
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.