Backend Development 15 min read

Optimizing a Rust STUN Server with Multi‑Threading, SO_REUSEPORT, and Linux recvmmsg/sendmmsg

The article shows how to transform a single‑threaded Rust STUN server into a high‑performance, multi‑core service by using Linux’s SO_REUSEPORT to bind multiple threads, assigning each to a NIC queue, and employing batch syscalls recvmmsg/sendmmsg, achieving over a million packets per second with significantly lower CPU usage.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Optimizing a Rust STUN Server with Multi‑Threading, SO_REUSEPORT, and Linux recvmmsg/sendmmsg

The article introduces the STUN protocol (Session Traversal Utilities for NAT) and its role in WebRTC, especially for establishing P2P connections in live‑streaming scenarios. It points out that popular open‑source STUN servers such as coturn (C) and stunserver (C++) are single‑threaded and become bottlenecks under high concurrency.

To support massive live‑streaming rooms, the author first describes the need for parallel STUN requests when a user connects to many peers (e.g., 12 peers per user). The theoretical request rate can reach millions of QPS, which stresses a single‑threaded server.

After a brief overview of the STUN RFC5389 message format, the article presents a minimal Rust implementation of a single‑threaded STUN server. The relevant Cargo dependency is stun = "0.4.1" . The core request‑processing function is:

use std::net::SocketAddr;
use stun::message::*;
use stun::xoraddr::*;
use nix::sys::socket::SockAddr;

fn process_stun_request(src_addr: SockAddr, buf: Vec
) -> Option
{
    let mut msg = Message::new();
    msg.raw = buf;
    if msg.decode().is_err() { return None; }
    if msg.typ != BINDING_REQUEST { return None; }
    match src_addr.to_string().parse::
() {
        Err(_) => None,
        Ok(src_skt_addr) => {
            let xoraddr = XorMappedAddress { ip: src_skt_addr.ip(), port: src_skt_addr.port() };
            msg.typ = BINDING_SUCCESS;
            msg.write_header();
            match xoraddr.add_to(&mut msg) { Err(_) => None, Ok(_) => Some(msg) }
        }
    }
}

The single‑thread server creates a UDP socket, binds it, and loops on recvfrom , calling the above function and sending the response.

use nix::sys::socket::{self, AddressFamily, InetAddr, MsgFlags, SockFlag, SockType, sockopt};
use std::net::IpAddr;

fn main() {
    let inet_addr = InetAddr::new(IpAddr::new_v4(0,0,0,0), 3478);
    run_single_thread(inet_addr);
}

pub fn run_single_thread(inet_addr: InetAddr) {
    let skt_addr = SockAddr::new_inet(inet_addr);
    let skt = socket::socket(AddressFamily::Inet, SockType::Datagram, SockFlag::empty(), None).unwrap();
    socket::bind(skt, &skt_addr).unwrap();
    let mut buf = [0u8; 50];
    loop {
        match socket::recvfrom(skt, &mut buf) {
            Err(_) => {},
            Ok((len, Some(src_addr))) => {
                if let Some(msg) = process_stun_request(src_addr, buf[..len].to_vec()) {
                    let _ = socket::sendto(skt, &msg.raw, &src_addr, MsgFlags::empty());
                }
            }
            _ => {}
        }
    }
}

To exploit multi‑core CPUs, the article explains network‑card multi‑queue (RSS) and how binding each RX queue to a specific CPU improves packet processing. It then introduces the Linux SO_REUSEPORT socket option, which allows multiple threads to bind the same UDP port without lock contention.

Using the num_cpus crate, the program spawns as many worker threads as there are CPU cores, each creating its own socket with setsockopt(..., ReusePort, &true) . The thread‑creation code is:

fn main() {
    let inet_addr = InetAddr::new(IpAddr::new_v4(0,0,0,0), 3478);
    let cpu_num = num_cpus::get();
    for _ in 1..=cpu_num {
        let addr = inet_addr.clone();
        std::thread::spawn(move || run_reuse_port(addr));
    }
    run_reuse_port(inet_addr);
}

The run_reuse_port function is identical to the single‑thread version except that it sets the ReusePort option before binding.

pub fn run_reuse_port(inet_addr: InetAddr) {
    let skt_addr = SockAddr::new_inet(inet_addr);
    let skt = socket::socket(AddressFamily::Inet, SockType::Datagram, SockFlag::empty(), None).unwrap();
    socket::setsockopt(skt, sockopt::ReusePort, &true).unwrap();
    socket::bind(skt, &skt_addr).unwrap();
    // same receive‑process‑reply loop as before
}

Further performance gains are achieved with Linux‑specific batch APIs recvmmsg and sendmmsg , which reduce the number of system calls. The article provides a full implementation that builds a list of receive buffers, calls recvmmsg with a timeout (e.g., 100 ms), processes each STUN request, and then sends responses using sendmmsg . The core loop looks like:

loop {
    let mut recv_msg_list = std::collections::LinkedList::new();
    let mut receive_buffers = [[0u8; 32]; 1000];
    let iovs: Vec<_> = receive_buffers.iter_mut().map(|buf| [IoVec::from_mut_slice(&mut buf[..])]).collect();
    for iov in &iovs { recv_msg_list.push_back(RecvMmsgData { iov, cmsg_buffer: None }); }
    let timeout = TimeSpec::from_duration(Duration::from_millis(100));
    let requests = socket::recvmmsg(skt, &mut recv_msg_list, MsgFlags::empty(), Some(timeout)).unwrap_or_default();
    // process each request, build msgs vector
    // build send_msg_list and call socket::sendmmsg(...);
}

Benchmark results on a dual‑NIC server show that the multi‑threaded, SO_REUSEPORT + recvmmsg/sendmmsg version reduces CPU usage by about 30 % while handling up to 2.1 Gbps traffic, 1 M pps inbound/outbound, and keeping memory usage under 2 %.

In conclusion, the author emphasizes that:

Multi‑threading combined with NIC multi‑queue and CPU affinity dramatically improves STUN server throughput.

Linux batch APIs ( recvmmsg , sendmmsg ) lower per‑packet overhead; the vlen and timeout must be tuned per workload.

Rust provides high performance, a modern toolchain, and expressive syntax, making it a suitable choice for network services.

performanceRustLinuxmultithreadingnetworkingSTUN
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.