When MTU Misconfiguration Turns Into a Two‑Day Network Mystery
A two‑day investigation of intermittent packet loss in a hybrid‑cloud Kubernetes environment revealed that an oversized VXLAN MTU caused fragmentation, prompting a step‑by‑step analysis of MTU fundamentals, diagnostic commands, Cilium configuration changes, and best‑practice recommendations for cloud‑native networks.
In a hybrid‑cloud setup (on‑premises Kubernetes clusters with Alibaba Cloud and AWS), a live‑streaming service began experiencing intermittent video stutter and high packet loss after a new feature increased payload size. Initial metrics showed a TCP retransmission rate of 2.3% (normal <0.1%) and a packet loss rate of 1.8%, with latency P99 jumping from 5 ms to 200 ms.
Standard checks of CPU, memory, disk I/O, and core switch health found nothing abnormal. The issue only appeared with large packets; small file transfers worked fine. A dmesg entry reported "eth0: dropped packet, size 1514 > 1500", pointing to an MTU mismatch.
MTU Basics
MTU (Maximum Transmission Unit) defines the largest packet a network device can carry. Common values include 1500 bytes for Ethernet II, 1492 bytes for PPPoE, 1476 bytes for GRE, 1450 bytes for VXLAN, ~1400 bytes for IPsec, and 9000 bytes for Jumbo Frames. MTU includes IP and TCP headers, while MSS (Maximum Segment Size) counts only the TCP payload (default 1460 bytes for Ethernet).
Path MTU Discovery (PMTUD) works by sending packets with the DF flag; routers that cannot forward the packet reply with ICMP "Fragmentation Needed". Many firewalls drop these ICMP messages, causing the "PMTUD black‑hole" problem.
Deep Dive Investigation
All interfaces reported MTU 1500. However, the VXLAN tunnel added a 50‑byte overhead (Ethernet 14 + IP 20 + UDP 8 + VXLAN 8), making the effective MTU 1450. Since the pod network MTU was still 1500, packets larger than 1450 bytes were fragmented and dropped.
Verification steps:
# Check physical NIC MTU
ip link show eth0 | grep mtu
# Check Cilium host MTU
ip link show cilium_host | grep mtu
# Check pod MTU
kubectl exec -it test-pod -- ip link show eth0 | grep mtuThe cilium_host interface incorrectly showed MTU 1500. A tcpdump capture on the host filtered for ICMP type 3 code 4 confirmed "Fragmentation Needed" messages with MTU 1450.
Cilium 1.15 changed the default PMTUD behavior, disabling automatic MTU mismatch handling. The previous version would have corrected the issue automatically.
Solution
Two primary fixes were applied:
Adjust the pod network MTU to 1450 and enable PMTUD in the Cilium ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
mtu: "1450"
enable-pmtu-discovery: "true"Restart Cilium daemonset to apply changes.
Apply TCP MSS clamping via iptables as a safety net:
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
# Or set a fixed MSS
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1400
iptables-save > /etc/iptables/rules.v4After the changes, a test script confirmed successful transmission up to 1422 bytes (1450 ‑ 28 bytes IP/ICMP header):
# mtu_test.sh
TARGET_IP=$1
for size in $(seq 1400 1500); do
if ping -M do -c 1 -s $size $TARGET_IP > /dev/null; then
echo "Size $size: OK"
else
echo "Size $size: FAIL <-- MTU boundary"
break
fi
doneBest‑Practice Recommendations
Standardize MTU across physical, virtual, and container layers (e.g., 1450 bytes for VXLAN environments).
Enable PMTUD and verify ICMP messages are not filtered by firewalls.
Document MTU settings in CMDB and include them in Ansible or other automation playbooks.
Monitor network‑layer metrics (packet drops, retransmissions) in Prometheus and set alerts.
Perform gradual rollouts and keep a control cluster when upgrading CNI plugins.
Tools & Scripts
Several diagnostic utilities were compiled:
Basic commands: ip -d link show, ip route get, ss -ti, cat /proc/net/snmp.
PMTUD test script ( pmtud_test.sh) performing binary search to find the maximum usable MTU.
One‑click diagnosis script ( mtu_diagnosis.sh) checking interface MTU, route cache, kernel parameters, drop statistics, and iptables MSS rules.
Health‑check script ( mtu_health_check.sh) verifying interface MTU, packet drops, PMTUD functionality, and kernel settings.
Advanced Scenarios
Guidance covers MTU handling in physical, virtual, container, cloud, SD‑WAN, GPU/RDMA, IPv6, and eBPF/XDP environments, including configuration snippets for Calico, Flannel, Cilium, Istio, and various cloud providers (AWS, Alibaba Cloud, Azure, GCP).
Production‑Level MTU Specification
# mtu-config-template.yaml
physical_network:
datacenter:
core_switch_mtu: 9000
access_switch_mtu: 9000
server_nic_mtu: 9000
wan:
mtu: 1500
virtualization:
hypervisor:
virtual_switch_mtu: 1500
vm_nic_mtu: 1500
container_network:
cni_mtu: 1450
pod_mtu: 1450
tunnels:
vxlan_mtu: 1450
ipsec_mtu: 1400
wireguard_mtu: 1420
cloud:
aws_vpc_mtu: 9001
aliyun_vpc_mtu: 1500
gcp_vpc_mtu: 1460A change‑management script ( mtu_change_procedure.sh) outlines pre‑change checks, backup, gradual rollout, verification, and monitoring steps.
Lessons Learned
Review CNI release notes before upgrades (Cilium 1.15 disabled PMTUD by default).
Include network‑layer metrics in monitoring dashboards.
Maintain a control environment for regression testing.
Document MTU configurations centrally to avoid ad‑hoc troubleshooting.
References
RFC 1191 – Path MTU Discovery
RFC 8899 – Packetization Layer Path MTU Discovery for Datagram Transports
Cilium Documentation – MTU Configuration
Calico Documentation – Configure MTU
Linux Kernel Documentation – ip‑sysctl.txt
Various cloud provider VPC networking guides
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
