Why a Zero‑Byte Packet Stopped a High‑Speed Load Balancer and How It Was Fixed
This article details the discovery, analysis, and resolution of a DPDK bug that caused a zero‑byte IP fragment to fill transmission queues, leading to packet loss and tx‑hang in UCloud's high‑availability ULB4 load balancer, and shares the tools and lessons learned for future debugging.
Problem Background
ULB4, UCloud's high‑availability L4 load balancer built on DPDK, experienced occasional server failures where a node was automatically removed from the cluster due to a sending‑direction traffic drop, while receiving traffic remained normal. The issue manifested as brief connection jitter for users before recovery.
Problem Diagnosis and Analysis
Using GDB, packet export tools, and traffic mirroring, the team captured abnormal packets from GB‑scale traffic. Disassembly of the
i40e_xmit_pkts()function revealed that the driver considered the transmission queue permanently full because it never saw the hardware‑written completion flag, causing subsequent packets to be dropped.
The root cause was an abnormal IP fragment: a 26‑byte packet filled with zeros, generated after a normal fragment’s second piece contained only an IP header. This malformed fragment caused the NIC to hang (tx‑hang), preventing further packet transmission.
Traffic Mirroring to Confirm Abnormal Packets
The team enabled port‑mirroring on the switch, set the NIC to promiscuous mode, disabled GRO, and captured traffic targeting the suspicious IP range. The captured packet showed an IP fragment with only a header, which after switch padding became a 26‑byte zero‑filled packet.
Solution
By adding a check to discard such malformed packets in the DPDK processing path, the issue was resolved. A test server ran for a day without further failures, and the fix was rolled out network‑wide.
DPDK Community Feedback
The bug was reported to the DPDK community and matched a commit dated 2022‑11‑06 that added a check for fragment length. The fix is included in DPDK 18.11.
Retrospective and Summary
The failure chain was:
DPDK cached the first fragment of a split packet.
The second fragment contained only an IP header; DPDK linked it with the first fragment without validation.
The combined packet was sent by ULB4, triggering a NIC tx‑hang.
The NIC stopped updating transmission descriptors, filling the queue and causing packet drops.
Direct hardware manipulation in user‑space (DPDK) can lead to such critical failures, highlighting the need for thorough validation of packet structures.
Tool Value
A one‑click packet export tool was developed to dump all packets from a DPDK driver queue and convert them to PCAP for analysis. The tool will be open‑sourced on UCloud's GitHub to aid other DPDK developers.
Final Thoughts
While DPDK is generally stable, edge cases like malformed fragments can cause severe issues in production gateways. Understanding DPDK internals and having robust debugging tools are essential for maintaining service reliability.
— END —
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.