Backend Development 8 min read

Optimizing MPTCP Flow Selection and Exploring a User‑Space MPTCP Stack – ByteDance STE at Netdev 0x19

At Netdev 0x19, ByteDance's STE team presented two technical talks: a NUMA‑aware MPTCP flow‑selection strategy that boosts Redis benchmark throughput by up to 30% and cuts tail latency by 6%, and a DPDK‑based user‑space MPTCP stack that halves latency and doubles throughput in data‑center tests.

Linux Kernel Journey

Mar 8, 2025

Optimizing MPTCP Flow Selection and Exploring a User‑Space MPTCP Stack – ByteDance STE at Netdev 0x19

Netdev 0x19, a Linux networking conference in Croatia, featured two talks by ByteDance's STE team focusing on Multipath TCP (MPTCP) innovations.

Topic 1 – NUMA‑locality‑aware MPTCP flow selection : Modern multi‑path servers often equip one NIC per CPU socket, allowing applications to receive data on the NIC sharing the same socket for better performance. Existing MPTCP path‑selection algorithms consider only TCP‑level metrics and ignore the additional latency introduced by application‑level system calls. The team proposes a new sub‑flow selection strategy that dynamically prefers the NIC on the same NUMA socket as the receiving process, incorporating end‑to‑end latency and throughput metrics. Benefits include higher application throughput, reduced tail latency, better overall bandwidth utilization, and lower memory/I‑O access latency.

Experimental results using a Redis memory‑tiering benchmark show that, compared with the default MPTCP configuration, the optimized strategy can increase throughput by up to 30% and reduce tail latency by 6% . The authors also discuss the behavior of the Completely Fair Scheduler (CFS) under increased load and note that maximizing the benefit requires minimizing cross‑socket scheduling.

Topic 2 – User‑space MPTCP stack built on DPDK : The team implemented a user‑space MPTCP stack using DPDK, targeting data‑center storage and high‑performance computing workloads. The stack follows RFC 8684, can interoperate with kernel‑space MPTCP, and automatically falls back to standard TCP when MPTCP negotiation fails, facilitating migration of existing TCP applications. It consists of two modules:

Sub‑flow management : Handles creation, destruction, and address notification of sub‑flows. Leveraging NIC Flow Bifurcation, multiple sub‑flows are processed within a single DPDK Poll Mode Driver (PMD), exploiting multi‑core parallelism while preserving the PMD’s shared‑nothing, lock‑free forwarding characteristics.

Sub‑flow scheduling : Determines which sub‑flow sends each packet, supporting various scheduling policies to meet different performance requirements.

The stack also provides a zero‑copy interface, eliminating copy overhead between the application and the protocol stack, further improving throughput and latency.

Performance measurements in a data‑center environment show that, versus a user‑space TCP implementation, the user‑space MPTCP stack reduces latency by roughly 10% and increases throughput by more than 100% (average packet size ≈ 1000 bytes) in normal forwarding scenarios, while also mitigating long‑tail latency under packet loss.

Future work includes open‑sourcing the stack, upstreaming it to the Linux kernel, and gathering community feedback for continued optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization DPDK NUMA MPTCP Linux networking user-space TCP

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.