Boost Linux Network Performance: Master RSS, RPS, RFS, and XPS Techniques
This article explains Linux network‑stack enhancements—RSS, RPS, RFS, accelerated RFS, and XPS—detailing their purpose, configuration steps, and recommended settings to improve parallelism and latency on multi‑CPU systems.
Introduction
This article describes a set of supplemental techniques in the Linux network stack that increase parallelism and performance on multiprocessor systems.
The techniques covered are:
RSS: Receive Side Scaling
RPS: Receive Packet Steering
RFS: Receive Flow Steering
Accelerated Receive Flow Steering
XPS: Transmit Packet Steering
RSS: Receive Side Scaling
Modern NICs support multiple receive and transmit descriptor queues. RSS distributes incoming packets across these queues using a hash‑based filter, steering each flow to a specific queue and therefore to a specific CPU. The hash typically uses a 4‑tuple of the packet and is implemented with a 128‑entry indirection table.
NICs can be programmed with ntuple filters (e.g., –config-ntuple) to direct traffic such as TCP port 80 to a chosen queue.
RSS configuration
Drivers expose a kernel module parameter (e.g., num_queues in the bnx2x driver) to set the number of hardware queues. A typical configuration assigns one receive queue per CPU or per NUMA node.
The indirection table is programmed at driver initialization and can be inspected or modified at runtime with –show-rxfh-indir and –set-rxfh-indir via ethtool.
RSS IRQ configuration
Each receive queue has an associated IRQ. MSI‑X routes each interrupt to a specific CPU; the mapping can be viewed in /proc/interrupts. Manual IRQ affinity can be set, but many environments run irqbalance which may override manual settings.
Recommended settings
Enable RSS when latency is critical or when receive‑interrupt processing becomes a bottleneck. Allocate as many queues as there are CPUs for low‑latency workloads; for high‑throughput scenarios a smaller number of queues may be optimal.
Use mpstat to monitor per‑CPU load. Hyper‑threading generally does not improve interrupt handling, so limit queues to the number of physical cores.
RSS is a NIC feature that uses hardware queues. To verify RSS support, check that an interface has multiple interrupt request queues in /proc/interrupts . Example for interface p1p1 with six receive queues: <code># egrep 'CPU|p1p1' /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 89: 40187 0 0 0 0 0 IR-PCI-MSI-edge p1p1-0 90: 0 790 0 0 0 0 IR-PCI-MSI-edge p1p1-1 ...</code> Queue information can also be listed under /sys/class/net/<dev>/queues and modified with ethtool -l , though many cloud environments restrict these operations.
RPS: Receive Packet Steering
RPS is a software implementation of RSS. After the hardware interrupt, the kernel’s soft‑IRQ places the packet into a per‑CPU backlog queue chosen by get_rps_cpu(). Advantages include hardware independence, easy addition of software filters, and no increase in hardware interrupt frequency.
RPS hashes the packet’s 2‑tuple or 4‑tuple to select a target CPU, using either hardware‑provided hash values (stored in skb->hash) or a software‑computed hash.
RPS configuration
Enable CONFIG_RPS in the kernel and set the CPU bitmap for each receive queue via /sys/class/net/<dev>/queues/rx-<n>/rps_cpus. A value of 0 disables RPS for that queue.
Recommended settings
For a single‑queue device, set rps_cpus to the CPUs in the same memory domain as the interrupt CPU, or to all CPUs if NUMA locality is not a concern. For multi‑queue systems with RSS already active, RPS may be redundant unless the number of hardware queues is less than the number of CPUs.
RPS uses /sys/class/net/<dev>/queues/rx-<n>/rps_cpus to configure CPU affinity. When the bitmap is 0, the packet is processed on the interrupt‑CPU.
RFS: Receive Flow Steering
RFS extends RPS by steering packets to the CPU that is currently processing the corresponding application thread, improving cache locality. It uses the same hash as RPS to index a global flow table ( rps_sock_flow_table) that records the last CPU handling each flow.
If the CPU recorded in the flow table matches the CPU selected by RPS, the packet is processed on that CPU; otherwise, the kernel may migrate the flow to a new CPU after ensuring no pending packets would cause reordering.
RFS configuration
Enable CONFIG_RPS (default on SMP) and configure the size of the global flow table via /proc/sys/net/core/rps_sock_flow_entries. Per‑queue flow table size is set with /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt. A typical value on a medium‑load server is 32768 entries.
Recommended settings
Set rps_sock_flow_entries to a power‑of‑two value that matches the expected number of concurrent connections. For a single‑queue device, align rps_flow_cnt with rps_sock_flow_entries; for multi‑queue devices, divide the entries among queues.
RFS flow tables are defined as: <code>struct rps_sock_flow_table { u32 mask; u32 ents[0]; }; </code>
Accelerated RFS (aRFS)
aRFS offloads the flow‑steering decision to NIC hardware that supports programmable ntuple filters. When set_rps_cpu() selects a CPU, the driver uses ndo_rx_flow_steer() to program a hardware filter that directs matching packets directly to the target CPU’s queue.
aRFS configuration
Requires kernel CONFIG_RFS_ACCEL, NIC support for ntuple filters, and driver implementation of ndo_rx_flow_steer(). Enable ntuple filtering with ethtool -K eth0 ntuple on. The driver periodically calls rps_may_expire_flow() to clean up stale filters.
Recommended settings
Use aRFS when the NIC and driver support it to achieve the lowest possible latency for flow‑steered traffic.
XPS: Transmit Packet Steering
XPS selects the transmit queue for outgoing packets either by mapping CPUs to queues or by mapping receive queues to transmit queues. This reduces lock contention and cache misses on multi‑queue devices.
XPS configuration
When CONFIG_XPS is enabled, drivers expose /sys/class/net/<dev>/queues/tx-<n>/xps_cpus for CPU‑based mapping and /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs for receive‑queue‑based mapping.
Recommended settings
Configure XPS on devices with multiple transmit queues. Align the number of queues with the number of CPUs when possible; otherwise, map each queue to the most cache‑friendly CPU.
Additional notes
RPS and RFS were introduced in kernel 2.6.35, XPS in 2.6.38, and aRFS also in 2.6.35. Tips include using /proc/irq/${irq_num}/smp_affinity for hard‑IRQ affinity, /sys/class/net/${net_dev}/queues/rx-0/rps_cpus for RPS CPU selection, and monitoring soft‑IRQs with watch -n 1 'cat /proc/softirqs |grep NET_RX'.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
