Backend Development 12 min read

Why a Zero‑Byte Packet Stopped a High‑Speed Load Balancer and How It Was Fixed

This article details the discovery, analysis, and resolution of a DPDK bug that caused a zero‑byte IP fragment to fill transmission queues, leading to packet loss and tx‑hang in UCloud's high‑availability ULB4 load balancer, and shares the tools and lessons learned for future debugging.

UCloud Tech
UCloud Tech
UCloud Tech
Why a Zero‑Byte Packet Stopped a High‑Speed Load Balancer and How It Was Fixed

Problem Background

ULB4, UCloud's high‑availability L4 load balancer built on DPDK, experienced occasional server failures where a node was automatically removed from the cluster due to a sending‑direction traffic drop, while receiving traffic remained normal. The issue manifested as brief connection jitter for users before recovery.

Problem Diagnosis and Analysis

Using GDB, packet export tools, and traffic mirroring, the team captured abnormal packets from GB‑scale traffic. Disassembly of the

i40e_xmit_pkts()

function revealed that the driver considered the transmission queue permanently full because it never saw the hardware‑written completion flag, causing subsequent packets to be dropped.

The root cause was an abnormal IP fragment: a 26‑byte packet filled with zeros, generated after a normal fragment’s second piece contained only an IP header. This malformed fragment caused the NIC to hang (tx‑hang), preventing further packet transmission.

Traffic Mirroring to Confirm Abnormal Packets

The team enabled port‑mirroring on the switch, set the NIC to promiscuous mode, disabled GRO, and captured traffic targeting the suspicious IP range. The captured packet showed an IP fragment with only a header, which after switch padding became a 26‑byte zero‑filled packet.

Solution

By adding a check to discard such malformed packets in the DPDK processing path, the issue was resolved. A test server ran for a day without further failures, and the fix was rolled out network‑wide.

DPDK Community Feedback

The bug was reported to the DPDK community and matched a commit dated 2022‑11‑06 that added a check for fragment length. The fix is included in DPDK 18.11.

Retrospective and Summary

The failure chain was:

DPDK cached the first fragment of a split packet.

The second fragment contained only an IP header; DPDK linked it with the first fragment without validation.

The combined packet was sent by ULB4, triggering a NIC tx‑hang.

The NIC stopped updating transmission descriptors, filling the queue and causing packet drops.

Direct hardware manipulation in user‑space (DPDK) can lead to such critical failures, highlighting the need for thorough validation of packet structures.

Tool Value

A one‑click packet export tool was developed to dump all packets from a DPDK driver queue and convert them to PCAP for analysis. The tool will be open‑sourced on UCloud's GitHub to aid other DPDK developers.

Final Thoughts

While DPDK is generally stable, edge cases like malformed fragments can cause severe issues in production gateways. Understanding DPDK internals and having robust debugging tools are essential for maintaining service reliability.

— END —

high availabilityload balancingnetwork troubleshootingDPDKpacket debugging
UCloud Tech
Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.