Why Weave FastDb Crashes VM Networks and How to Fix It
This article explains why Weave's FastDb mode can cause network interruptions on CentOS 7 VMs, analyzes the underlying kernel bugs and UDP PMTU probing issues, and provides practical solutions such as kernel upgrades, disabling UFO, and adjusting MTU settings.
Weave is a Docker cross‑host networking solution that creates an overlay network so containers can communicate without manual port mapping or linking. It supports encrypted traffic, allowing connections from untrusted networks.
Since version 1.2, Weave moved from a userspace implementation to using the kernel Open vSwitch (OVS) module with VXLAN, greatly improving performance. This tighter kernel integration introduces a few kernel‑related pitfalls.
Problem 1: Network interruption caused by Weave FastDb
FastDb became the default mode after version 1.2, using the kernel OVS module and VXLAN encapsulation. On CentOS 7.0 (kernel 3.10.123) VMs created with qemu‑kvm, enabling FastDb makes the virtio_net NIC unable to send data, leading to a complete network outage.
Analysis
The issue is triggered by a kernel bug (see commit http://t.cn/Ro53BsH ). Weave sends a 60 KB UDP packet for PMTU discovery using a raw socket, which corrupts the memory used by virtio_net, preventing the host from receiving packets.
The same problem can occur with any application that uses a raw socket to send packets larger than the interface MTU while the interface’s UFO feature is enabled.
Solution
Upgrade the kernel to version 3.13 or newer.
Disable the UFO feature on the VM’s network interface.
Use CentOS 7.1 kernel 3.10.229, which already contains the fix.
Problem 2: Weave cannot use FastDb mode
On CentOS 7 (kernel 3.10.0‑327.10.1.el7.x86_64) with MTU ≤ 1474, Weave fails to select FastDb and falls back to sleeve mode, despite FastDb being the default.
Analysis
FastDb relies on ODP (OVS datapath). Weave decides the mode by sending a heartbeat UDP packet (port 6784). When the packet size (1474 bytes) exceeds the interface MTU (e.g., 1454), fragmentation occurs and the heartbeat cannot be sent. Adjusting MTU to 1500 resolves the issue.
The kernel then generates an ICMP “Fragmentation Needed” error:
The error originates from the kernel’s ip_fragment function calling ICMP_send:
if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
(IPCB(skb)->frag_max_size && IPCB(skb)->frag_max_size > mtu))) {
IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
ICMP_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(mtu));
kfree_skb(skb);
return -EMSGSIZE;
}Weave’s heartbeat code sets skb->ignore_df = 1 in vxlan_tnl_send, but skb_scrub_packet later resets it to 0, causing the kernel to treat the packet as DF‑set and emit the ICMP error.
void skb_scrub_packet(struct sk_buff *skb, bool xnet)
{
skb->tstamp.tv64 = 0;
skb->pkt_type = PACKET_HOST;
skb->skb_iif = 0;
skb->ignore_df = 0;
skb_dst_drop(skb);
secpath_reset(skb);
nf_reset(skb);
nf_reset_trace(skb);
if (!xnet)
return;
skb_orphan(skb);
skb->mark = 0;
}Older kernels (e.g., 3.10.0‑123.el7) call secpath_reset without resetting local_df, so the bug does not appear there.
static inline void
secpath_reset(struct sk_buff *skb)
{
#ifdef CONFIG_XFRM
secpath_put(skb->sp);
skb->sp = NULL;
#endif
}Solution
Ensure the interface MTU is set to the default 1500.
Conclusion
Weave’s ODP feature relies heavily on kernel capabilities; both FastDb‑related network interruptions stem from kernel‑level interactions. Understanding the kernel’s behavior allows you to diagnose and resolve these issues effectively.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
