Cloud Native 30 min read

When MTU Misconfiguration Turns Into a Two‑Day Network Mystery

A two‑day investigation of intermittent packet loss in a hybrid‑cloud Kubernetes environment revealed that an oversized VXLAN MTU caused fragmentation, prompting a step‑by‑step analysis of MTU fundamentals, diagnostic commands, Cilium configuration changes, and best‑practice recommendations for cloud‑native networks.

dbaplus Community
dbaplus Community
dbaplus Community
When MTU Misconfiguration Turns Into a Two‑Day Network Mystery

In a hybrid‑cloud setup (on‑premises Kubernetes clusters with Alibaba Cloud and AWS), a live‑streaming service began experiencing intermittent video stutter and high packet loss after a new feature increased payload size. Initial metrics showed a TCP retransmission rate of 2.3% (normal <0.1%) and a packet loss rate of 1.8%, with latency P99 jumping from 5 ms to 200 ms.

Standard checks of CPU, memory, disk I/O, and core switch health found nothing abnormal. The issue only appeared with large packets; small file transfers worked fine. A dmesg entry reported "eth0: dropped packet, size 1514 > 1500", pointing to an MTU mismatch.

MTU Basics

MTU (Maximum Transmission Unit) defines the largest packet a network device can carry. Common values include 1500 bytes for Ethernet II, 1492 bytes for PPPoE, 1476 bytes for GRE, 1450 bytes for VXLAN, ~1400 bytes for IPsec, and 9000 bytes for Jumbo Frames. MTU includes IP and TCP headers, while MSS (Maximum Segment Size) counts only the TCP payload (default 1460 bytes for Ethernet).

Path MTU Discovery (PMTUD) works by sending packets with the DF flag; routers that cannot forward the packet reply with ICMP "Fragmentation Needed". Many firewalls drop these ICMP messages, causing the "PMTUD black‑hole" problem.

Deep Dive Investigation

All interfaces reported MTU 1500. However, the VXLAN tunnel added a 50‑byte overhead (Ethernet 14 + IP 20 + UDP 8 + VXLAN 8), making the effective MTU 1450. Since the pod network MTU was still 1500, packets larger than 1450 bytes were fragmented and dropped.

Verification steps:

# Check physical NIC MTU
ip link show eth0 | grep mtu
# Check Cilium host MTU
ip link show cilium_host | grep mtu
# Check pod MTU
kubectl exec -it test-pod -- ip link show eth0 | grep mtu

The cilium_host interface incorrectly showed MTU 1500. A tcpdump capture on the host filtered for ICMP type 3 code 4 confirmed "Fragmentation Needed" messages with MTU 1450.

Cilium 1.15 changed the default PMTUD behavior, disabling automatic MTU mismatch handling. The previous version would have corrected the issue automatically.

Solution

Two primary fixes were applied:

Adjust the pod network MTU to 1450 and enable PMTUD in the Cilium ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  mtu: "1450"
  enable-pmtu-discovery: "true"

Restart Cilium daemonset to apply changes.

Apply TCP MSS clamping via iptables as a safety net:

iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
# Or set a fixed MSS
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1400
iptables-save > /etc/iptables/rules.v4

After the changes, a test script confirmed successful transmission up to 1422 bytes (1450 ‑ 28 bytes IP/ICMP header):

# mtu_test.sh
TARGET_IP=$1
for size in $(seq 1400 1500); do
  if ping -M do -c 1 -s $size $TARGET_IP > /dev/null; then
    echo "Size $size: OK"
  else
    echo "Size $size: FAIL <-- MTU boundary"
    break
  fi
done

Best‑Practice Recommendations

Standardize MTU across physical, virtual, and container layers (e.g., 1450 bytes for VXLAN environments).

Enable PMTUD and verify ICMP messages are not filtered by firewalls.

Document MTU settings in CMDB and include them in Ansible or other automation playbooks.

Monitor network‑layer metrics (packet drops, retransmissions) in Prometheus and set alerts.

Perform gradual rollouts and keep a control cluster when upgrading CNI plugins.

Tools & Scripts

Several diagnostic utilities were compiled:

Basic commands: ip -d link show, ip route get, ss -ti, cat /proc/net/snmp.

PMTUD test script ( pmtud_test.sh) performing binary search to find the maximum usable MTU.

One‑click diagnosis script ( mtu_diagnosis.sh) checking interface MTU, route cache, kernel parameters, drop statistics, and iptables MSS rules.

Health‑check script ( mtu_health_check.sh) verifying interface MTU, packet drops, PMTUD functionality, and kernel settings.

Advanced Scenarios

Guidance covers MTU handling in physical, virtual, container, cloud, SD‑WAN, GPU/RDMA, IPv6, and eBPF/XDP environments, including configuration snippets for Calico, Flannel, Cilium, Istio, and various cloud providers (AWS, Alibaba Cloud, Azure, GCP).

Production‑Level MTU Specification

# mtu-config-template.yaml
physical_network:
  datacenter:
    core_switch_mtu: 9000
    access_switch_mtu: 9000
    server_nic_mtu: 9000
  wan:
    mtu: 1500
virtualization:
  hypervisor:
    virtual_switch_mtu: 1500
    vm_nic_mtu: 1500
container_network:
  cni_mtu: 1450
  pod_mtu: 1450
tunnels:
  vxlan_mtu: 1450
  ipsec_mtu: 1400
  wireguard_mtu: 1420
cloud:
  aws_vpc_mtu: 9001
  aliyun_vpc_mtu: 1500
  gcp_vpc_mtu: 1460

A change‑management script ( mtu_change_procedure.sh) outlines pre‑change checks, backup, gradual rollout, verification, and monitoring steps.

Lessons Learned

Review CNI release notes before upgrades (Cilium 1.15 disabled PMTUD by default).

Include network‑layer metrics in monitoring dashboards.

Maintain a control environment for regression testing.

Document MTU configurations centrally to avoid ad‑hoc troubleshooting.

References

RFC 1191 – Path MTU Discovery

RFC 8899 – Packetization Layer Path MTU Discovery for Datagram Transports

Cilium Documentation – MTU Configuration

Calico Documentation – Configure MTU

Linux Kernel Documentation – ip‑sysctl.txt

Various cloud provider VPC networking guides

KubernetesNetwork TroubleshootingMTUCiliumOverlay NetworksPMTUD
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.