Fundamentals 29 min read

Boost Linux Network Performance: Master RSS, RPS, RFS, and XPS Techniques

This article explains Linux network‑stack enhancements—RSS, RPS, RFS, accelerated RFS, and XPS—detailing their purpose, configuration steps, and recommended settings to improve parallelism and latency on multi‑CPU systems.

MaGe Linux Operations

Jul 10, 2024

Boost Linux Network Performance: Master RSS, RPS, RFS, and XPS Techniques

Introduction

This article describes a set of supplemental techniques in the Linux network stack that increase parallelism and performance on multiprocessor systems.

The techniques covered are:

RSS: Receive Side Scaling

RPS: Receive Packet Steering

RFS: Receive Flow Steering

Accelerated Receive Flow Steering

XPS: Transmit Packet Steering

RSS: Receive Side Scaling

Modern NICs support multiple receive and transmit descriptor queues. RSS distributes incoming packets across these queues using a hash‑based filter, steering each flow to a specific queue and therefore to a specific CPU. The hash typically uses a 4‑tuple of the packet and is implemented with a 128‑entry indirection table.

NICs can be programmed with ntuple filters (e.g., –config-ntuple) to direct traffic such as TCP port 80 to a chosen queue.

RSS configuration

Drivers expose a kernel module parameter (e.g., num_queues in the bnx2x driver) to set the number of hardware queues. A typical configuration assigns one receive queue per CPU or per NUMA node.

The indirection table is programmed at driver initialization and can be inspected or modified at runtime with –show-rxfh-indir and –set-rxfh-indir via ethtool.

RSS IRQ configuration

Each receive queue has an associated IRQ. MSI‑X routes each interrupt to a specific CPU; the mapping can be viewed in /proc/interrupts. Manual IRQ affinity can be set, but many environments run irqbalance which may override manual settings.

Recommended settings

Enable RSS when latency is critical or when receive‑interrupt processing becomes a bottleneck. Allocate as many queues as there are CPUs for low‑latency workloads; for high‑throughput scenarios a smaller number of queues may be optimal.

Use mpstat to monitor per‑CPU load. Hyper‑threading generally does not improve interrupt handling, so limit queues to the number of physical cores.

RSS is a NIC feature that uses hardware queues. To verify RSS support, check that an interface has multiple interrupt request queues in /proc/interrupts . Example for interface p1p1 with six receive queues: <code># egrep 'CPU|p1p1' /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 89: 40187 0 0 0 0 0 IR-PCI-MSI-edge p1p1-0 90: 0 790 0 0 0 0 IR-PCI-MSI-edge p1p1-1 ...</code> Queue information can also be listed under /sys/class/net/<dev>/queues and modified with ethtool -l , though many cloud environments restrict these operations.

RPS: Receive Packet Steering

RPS is a software implementation of RSS. After the hardware interrupt, the kernel’s soft‑IRQ places the packet into a per‑CPU backlog queue chosen by get_rps_cpu(). Advantages include hardware independence, easy addition of software filters, and no increase in hardware interrupt frequency.

RPS hashes the packet’s 2‑tuple or 4‑tuple to select a target CPU, using either hardware‑provided hash values (stored in skb->hash) or a software‑computed hash.

RPS configuration

Enable CONFIG_RPS in the kernel and set the CPU bitmap for each receive queue via /sys/class/net/<dev>/queues/rx-<n>/rps_cpus. A value of 0 disables RPS for that queue.

Recommended settings

For a single‑queue device, set rps_cpus to the CPUs in the same memory domain as the interrupt CPU, or to all CPUs if NUMA locality is not a concern. For multi‑queue systems with RSS already active, RPS may be redundant unless the number of hardware queues is less than the number of CPUs.

RPS uses /sys/class/net/<dev>/queues/rx-<n>/rps_cpus to configure CPU affinity. When the bitmap is 0, the packet is processed on the interrupt‑CPU.

RFS: Receive Flow Steering

RFS extends RPS by steering packets to the CPU that is currently processing the corresponding application thread, improving cache locality. It uses the same hash as RPS to index a global flow table ( rps_sock_flow_table) that records the last CPU handling each flow.

If the CPU recorded in the flow table matches the CPU selected by RPS, the packet is processed on that CPU; otherwise, the kernel may migrate the flow to a new CPU after ensuring no pending packets would cause reordering.

RFS configuration

Enable CONFIG_RPS (default on SMP) and configure the size of the global flow table via /proc/sys/net/core/rps_sock_flow_entries. Per‑queue flow table size is set with /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt. A typical value on a medium‑load server is 32768 entries.

Recommended settings

Set rps_sock_flow_entries to a power‑of‑two value that matches the expected number of concurrent connections. For a single‑queue device, align rps_flow_cnt with rps_sock_flow_entries; for multi‑queue devices, divide the entries among queues.

RFS flow tables are defined as: <code>struct rps_sock_flow_table { u32 mask; u32 ents[0]; }; </code>

Accelerated RFS (aRFS)

aRFS offloads the flow‑steering decision to NIC hardware that supports programmable ntuple filters. When set_rps_cpu() selects a CPU, the driver uses ndo_rx_flow_steer() to program a hardware filter that directs matching packets directly to the target CPU’s queue.

aRFS configuration

Requires kernel CONFIG_RFS_ACCEL, NIC support for ntuple filters, and driver implementation of ndo_rx_flow_steer(). Enable ntuple filtering with ethtool -K eth0 ntuple on. The driver periodically calls rps_may_expire_flow() to clean up stale filters.

Recommended settings

Use aRFS when the NIC and driver support it to achieve the lowest possible latency for flow‑steered traffic.

XPS: Transmit Packet Steering

XPS selects the transmit queue for outgoing packets either by mapping CPUs to queues or by mapping receive queues to transmit queues. This reduces lock contention and cache misses on multi‑queue devices.

XPS configuration

When CONFIG_XPS is enabled, drivers expose /sys/class/net/<dev>/queues/tx-<n>/xps_cpus for CPU‑based mapping and /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs for receive‑queue‑based mapping.

Recommended settings

Configure XPS on devices with multiple transmit queues. Align the number of queues with the number of CPUs when possible; otherwise, map each queue to the most cache‑friendly CPU.

Additional notes

RPS and RFS were introduced in kernel 2.6.35, XPS in 2.6.38, and aRFS also in 2.6.35. Tips include using /proc/irq/${irq_num}/smp_affinity for hard‑IRQ affinity, /sys/class/net/${net_dev}/queues/rx-0/rps_cpus for RPS CPU selection, and monitoring soft‑IRQs with watch -n 1 'cat /proc/softirqs |grep NET_RX'.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Network Linux RSS RPS RFS XPS

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.