How NUMA‑Aware MPTCP Flow Selection Boosts Throughput and Cuts Latency
At Netdev 0x19, ByteDance's STE team presented two talks—one on a NUMA‑locality‑aware MPTCP flow‑selection strategy that can raise throughput by up to 30% and lower tail latency by 6%, and another on a DPDK‑based user‑space MPTCP stack that reduces latency by nearly 10% and more than doubles throughput—showcasing practical performance gains for data‑center networking.
Netdev 0x19 Overview
Netdev 0x19, a premier Linux networking conference, was held on March 10 in Croatia, gathering experts, researchers, and industry representatives to discuss cutting‑edge developments and future trends in network technology.
Talk 1: NUMA‑Aware MPTCP Flow Selection Optimization
The ByteDance STE team presented a new MPTCP sub‑flow selection strategy that dynamically prefers network interfaces located on the same socket as the receiving application process. Traditional MPTCP path selection only considers TCP‑level metrics and ignores the additional latency introduced by application‑level system calls.
By incorporating end‑to‑end metrics, the proposed method improves application throughput and latency. Experimental results on a Redis memory‑hierarchy benchmark show up to 30% higher throughput and 6% lower tail latency compared with the default MPTCP configuration. The approach also reduces cross‑NUMA traffic, balances load across NICs, and can benefit from reduced contention on system bandwidth, memory, and I/O.
Talk 2: User‑Space DPDK‑Based MPTCP Stack for Data Centers
The team demonstrated an innovative user‑space MPTCP implementation built on DPDK, targeting storage and high‑performance computing workloads in data centers. The stack follows RFC 8684, interoperates with kernel MPTCP, and automatically falls back to standard TCP when MPTCP negotiation fails, facilitating seamless migration of existing TCP applications.
The stack consists of two main modules:
Sub‑flow Management : Handles creation, destruction, and address notification of sub‑flows, leveraging NIC flow bifurcation and DPDK poll‑mode drivers to achieve lock‑free forwarding while fully exploiting multi‑core processing.
Sub‑flow Scheduling : Implements various scheduling policies to meet diverse performance requirements.
A zero‑copy interface further eliminates copy overhead between the application and the stack, boosting both throughput and latency.
Preliminary performance tests inside a data‑center environment show that, compared with a user‑space TCP stack, the user‑space MPTCP stack achieves nearly 10% lower latency and over 100% higher throughput for typical packet sizes (~1000 bytes), with significant tail‑latency improvements under loss conditions.
About the STE Team
The System Technologies & Engineering (STE) team at ByteDance focuses on operating‑system kernels, virtualization, foundational system libraries, large‑scale data‑center reliability, and co‑design of new hardware and software, actively contributing to open‑source communities.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.