How to Diagnose and Resolve Common Pod Network Issues in Kubernetes
This guide outlines a systematic approach to troubleshooting Kubernetes pod network problems, covering failure models, essential tools like tcpdump, nsenter, paping, and mtr, and detailed case studies that illustrate root cause analysis and remediation steps.
The article introduces a structured methodology for diagnosing network anomalies in Kubernetes clusters, beginning with a classification of pod network failures into six categories: network unreachable, port unreachable, DNS resolution errors, large packet loss, CNI plugin issues, and other CNI-related problems. Each category lists typical symptoms (e.g., ping failures, telnet failures, DNS lookup failures) and common causes such as firewall rules, routing misconfigurations, overloaded interfaces, or misconfigured CNI components.
Common Network Troubleshooting Tools
Key utilities are described with installation commands for various Linux distributions and example usage syntax: tcpdump – capture packets, filter by host, port, network, size, or TCP flags; save to PCAP files with -w and combine logical operators. nsenter – enter a container’s network namespace from the host using nsenter -t <pid> -n <command>, useful when the container lacks sudo or networking tools. paping – continuous TCP ping for port reachability and loss rate; install dependencies on RedHat/CentOS ( yum install -y libstdc++.i686 glibc.i686) or Debian/Ubuntu (no extra packages). mtr – combines traceroute and ping, showing loss percentage, latency statistics, and supports options like -n (numeric IP), -b (both host and IP), -c (count), -r (report), -m (max hops), and -s (packet size).
Pod Network Troubleshooting Workflow
The workflow diagram (illustrated in the original article) guides investigators from symptom identification through tool selection, packet capture points (node, veth, CNI device, host NIC), and analysis of NAT/DNAT behavior.
Case Study 1: Service Unreachable After Node Expansion
Environment: Kubernetes with Flannel VXLAN, kube-proxy in iptables mode, a registry service (ClusterIP 10.233.0.100, backend pod IP 10.233.65.46). After adding a new node (10.153.204.15), the node could not reach the service’s ClusterIP, while other nodes worked.
Checked kube-proxy pod status and logs – no errors.
Examined iptables NAT rules – appeared correct.
Verified routing to the backend pod’s subnet – present.
Captured packets on the sending node, flannel interfaces, and the backend node. Found that the source IP was rewritten incorrectly (10.153.204.228 instead of 10.153.204.15) due to the node having both a static IP and a DHCP‑assigned IP, causing IP conflict.
Resolution: removed the DHCP configuration from /etc/sysconfig/network-scripts/ifcfg-enp26s0f0 (set BOOTPROTO="none") and restarted Docker and kubelet.
Case Study 2: External Cloud Host Times Out Accessing Cluster Service
Scenario: A cloud VM attempts an HTTP POST to a NodePort service but times out. Telnet to the IP:port succeeds, indicating basic connectivity.
Wireshark capture shows TCP three‑way handshake succeeds, but large packets (≈1514 bytes) are repeatedly retransmitted without ACK.
MTU mismatch identified: the VM’s NIC MTU is 1500, while the Calico tunnel interface MTU is 1440, causing fragmentation loss.
Resolution: align MTU values by setting the VM’s NIC MTU to 1440 or adjusting Calico’s MTU to 1500.
Case Study 3: Pods Cannot Reach Object Storage by DNS
Environment: Public‑cloud Kubernetes with Calico IP‑IP mode, services exposed via NodePort, and pods using the cluster DNS which forwards to an upstream DNS.
nsenter into the pod shows IP connectivity to the storage endpoint but DNS lookup fails ("unknown server name").
Telnet to storage ports works; paping shows no loss.
Investigation reveals that newly created nodes have kube‑proxy pods stuck in Pending, preventing service IP routing, including DNS.
Root cause: kube‑proxy pods lacked the highest priority class, so they were evicted under resource pressure.
Resolution: assign system-node-critical priority class to kube‑proxy and add readiness probes to application pods.
Overall, the article emphasizes the importance of understanding failure models, using the right packet‑capture points, and correlating tool output with cluster configuration to pinpoint and fix network issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
