How to Diagnose Kubernetes Pod Network Failures: Models, Tools, and Real Cases
This article introduces a systematic approach to troubleshooting Kubernetes pod network issues, covering common failure types, essential diagnostic tools like tcpdump, nsenter, paping, and mtr, and detailed case studies that illustrate step‑by‑step analysis and resolution of real‑world connectivity problems.
Pod Network Anomalies
The article classifies pod network problems into categories such as unreachable network (ping fails), unreachable port (telnet fails), DNS resolution errors, and large packet loss, each with possible causes like firewall rules, routing misconfigurations, high system load, or MTU mismatches.
Common Diagnostic Tools
tcpdump
tcpdump captures packets on interfaces and can filter by host, port, protocol, etc.
tcpdump -i eth0 -nn host 220.181.57.216 and 10.0.0.1nsenter
nsenter allows entering a container's network namespace to run commands like ifconfig or netstat when the container lacks those utilities.
nsenter -t 30858 -n ifconfigpaping
paping continuously pings a TCP port to test connectivity and packet loss.
paping -p 80 -c 10 example.commtr
mtr combines traceroute and ping, showing loss percentage and latency per hop.
mtr -n google.comPod Network Troubleshooting Process
The workflow starts with confirming pod‑to‑pod communication, then checking service IP reachability, followed by inspecting CNI plugins, kube‑proxy rules, and finally capturing packets on relevant interfaces (veth, cni0, flannel) to pinpoint the failing node.
Case Study 1: Service Unreachable After Node Expansion
A newly added node could not reach a ClusterIP service (10.233.0.100:5000) while other nodes could. Investigation showed correct CNI and kube‑proxy status, but packet captures revealed mismatched source IPs due to the node having both static and DHCP addresses, causing IP conflict. The fix was to remove the duplicate DHCP configuration and restart Docker and kubelet.
Case Study 2: External Host Timeout to NodePort Service
An external VM could telnet to the NodePort but HTTP requests timed out. Wireshark showed successful TCP handshake but large packets (>1400 bytes) were repeatedly retransmitted. The MTU mismatch between the VM (1500) and the Calico tunnel (1440) caused fragmentation issues. Adjusting the MTU on the VM or Calico resolved the problem.
Case Study 3: Pod DNS Failure Accessing Object Storage
Pods could reach the object‑storage IP but failed to resolve its domain name. DNS queries to the cluster DNS succeeded, but upstream DNS lookups timed out. The root cause was that kube‑proxy pods on newly added nodes were being evicted due to missing priority class, breaking service DNS for those nodes. Assigning system-node-critical priority to kube‑proxy restored DNS functionality.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
