Midnight Kernel Panic: How a Corrupt skb Triggered Chaos and What We Learned
An in‑depth post‑mortem of a recurring midnight kernel panic caused by a malformed sk_buff structure, detailing how vmcore analysis, packet inspection, and netfilter hook tracing pinpointed a third‑party module bug in Linux 5.10, and outlining remediation steps.
Midnight Kernel Panic Overview
The article describes a kernel fault investigation triggered by an abnormal skb (socket buffer) structure that caused nightly panics after a kernel upgrade.
What Is skb?
skb, short for sk_buff, is a core data structure in the Linux kernel networking stack that carries packets between protocol layers. It stores packet headers, payload, length, flags, and internal management pointers.
Key skb Fields
skb->len : actual packet length (header + payload).
skb->data_len : length of the payload; when set, the skb is non‑linear.
Linear skb : data resides in a contiguous memory block.
Non‑linear skb : data is split across multiple non‑contiguous blocks, useful for high‑throughput scenarios.
The fault manifested because skb->len was smaller than skb->data_len, an impossible condition that caused the kernel panic.
Initial Diagnosis with vmcore
Opening the vmcore revealed that len was consistently 4 bytes less than data_len. This pattern suggested a systematic handling error rather than random memory corruption.
Locating the Faulty skb
By inspecting the packet headers in the vmcore, the team identified an IPv4 packet with a UDP payload. The packet originated from port 62463 and was destined for port 43122, matching the service that crashed.
Tracing Netfilter Hooks
The crash occurred between the IP layer processing and ip_local_deliver_finish, pointing to a netfilter hook under NF_INET_LOCAL_IN. Only one hook was present, belonging to the sn_core_odd module.
Root Cause
The issue stemmed from a change in Linux kernel 5.10: the skb_make_writable function no longer copies fragmented data into a linear area, causing third‑party modules that relied on the old behavior to corrupt the skb length fields.
Remediation
Removing the offending sn_core_odd module stopped the panics. The team also recommends updating third‑party network modules to be compatible with the newer kernel API and considering eBPF‑based replacements to reduce maintenance risk.
Conclusion
Kernel panics caused by malformed skbs can be traced through vmcore analysis, packet inspection, and netfilter hook tracing. Proper handling of skb linearization in kernel 5.10 is essential for third‑party modules.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
