Operations 14 min read

Optimizing QUIC Gateway Performance with AF_XDP

Bilibili’s video CDN replaced its traditional TCP‑based gateway with a QUIC/HTTP‑3 gateway and, to curb the extra CPU load caused by complex UDP handling, adopted AF_XDP kernel‑bypass sockets that redirect packets via XDP, cutting CPU usage by about half, raising peak bandwidth to roughly 9 Gbps and improving per‑bandwidth efficiency by up to 30 %.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Optimizing QUIC Gateway Performance with AF_XDP

Bilibili's video CDN has fully deployed a QUIC gateway based on the QUIC and HTTP/3 protocols. Compared with the traditional TCP gateway, the QUIC gateway improves first‑frame latency, stall rate and failure rate, but the more complex protocol header and Linux kernel's sub‑optimal UDP handling cause higher CPU consumption.

To provide a more stable and smooth viewing experience while reducing resource costs, the network protocol team selected AF_XDP as a kernel‑bypass technique to accelerate packet I/O for the QUIC gateway.

About AF_XDP – AF_XDP is built on eBPF. In the Linux kernel, BPF (originally BSD Packet Filter) is a lightweight virtual machine that can run user‑supplied bytecode at various hook points. eBPF extends BPF with richer functionality. XDP (eXPress Data Path) is an eBPF hook that processes packets as soon as they arrive at the NIC driver. XDP has three modes: generic, native and offload. Native mode runs the XDP program directly in the driver, offering the best performance; generic runs in the kernel stack; offload runs on NIC hardware that supports it.

AF_XDP provides a socket interface that redirects packets from an XDP program to user‑space memory (UMEM). An AF_XDP socket (xsk) contains an RX ring and a TX ring; UMEM consists of fixed‑size buffers described by descriptors stored in a FILL ring and a COMPLETION ring. The user fills the FILL ring with buffer descriptors, the kernel consumes them to receive packets, places received descriptors into the RX ring, and the application reads them. For transmission, the application writes data into UMEM buffers, places descriptors on the TX ring, and the kernel sends the packets, returning completed descriptors on the COMPLETION ring.

QUIC performance issues – QUIC runs in user space over UDP. Its protocol logic is complex, and Linux's UDP handling in the kernel is less efficient than TCP, leading to higher CPU load and potential resource shortages during peak traffic.

AF_XDP‑based QUIC gateway architecture – The QUIC server (quic‑server) replaces the traditional UDP socket with an AF_XDP socket. Each server thread owns its own xsk, which is bound to a specific NIC queue via an XDP_REDIRECT program. Incoming packets are redirected from the NIC to the appropriate xsk, processed by the thread, and outgoing packets are sent through the TX ring using a custom XskBatchWriter that batches writes for higher throughput.

Key XDP program (simplified):

SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    struct ethhdr *eth = data;
    unsigned short h_proto;

    if (eth + 1 > data_end) {
        return XDP_DROP;
    }

    h_proto = eth->h_proto;

    if (h_proto == htons(ETH_P_IP)) {
        return handle_ipv4(ctx);
    } else if (h_proto == htons(ETH_P_IPV6)) {
        return handle_ipv6(ctx);
    } else {
        return XDP_PASS;
    }
}

IPv4 handling helper:

static __always_inline int handle_ipv4(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    int dport;
    unsigned int key = 0;

    struct iphdr *iph = data + sizeof(struct ethhdr);

    if (iph + 1 > data_end) {
        return XDP_DROP;
    }

    if (iph->protocol != IPPROTO_UDP) {
        return XDP_PASS;
    }

    dport = get_udp_dport(iph + 1, data_end);
    if (dport == -1) {
        return XDP_DROP;
    } else if (dport == expect_udp_dport) {
        key = ctx->rx_queue_index;
        return bpf_redirect_map(&xsks_map, key, 0);
    }

    return XDP_PASS;
}

The code redirects UDP packets destined for the QUIC server's port to the appropriate xsk, where the server reads them via epoll and processes HTTP/3 requests.

Performance analysis – Without AF_XDP, the QUIC server uses the kernel UDP socket, incurring long kernel‑user transitions and extra copies. With native XDP and zero‑copy UMEM, the packet path is dramatically shortened. Benchmarks show:

CPU load reduced by roughly 50 % for single‑threaded tests at comparable bandwidth.

Maximum achievable bandwidth increased from ~7 Gbps (non‑XDP) to ~9 Gbps (XDP).

CPU‑per‑bandwidth efficiency improved by 25‑30 % in multi‑threaded, multi‑range scenarios.

Figures (tables and charts) in the original article illustrate these gains.

Future outlook – The team plans to open‑source the AF_XDP modules to the bilibili/quiche repository, develop dedicated AF_XDP performance analysis tools (since generic tools like tcpdump become ineffective), and extend AF_XDP usage to other UDP‑based protocols such as RTP and data channels.

eBPFnetwork optimizationLinux kernelQUICperformance engineeringAF_XDP
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.