Cloud Native 26 min read

How to Diagnose and Resolve Common Pod Network Issues in Kubernetes

This guide outlines a systematic approach to troubleshooting Kubernetes pod network problems, covering failure models, essential tools like tcpdump, nsenter, paping, and mtr, and detailed case studies that illustrate root cause analysis and remediation steps.

dbaplus Community

Sep 6, 2023

How to Diagnose and Resolve Common Pod Network Issues in Kubernetes

The article introduces a structured methodology for diagnosing network anomalies in Kubernetes clusters, beginning with a classification of pod network failures into six categories: network unreachable, port unreachable, DNS resolution errors, large packet loss, CNI plugin issues, and other CNI-related problems. Each category lists typical symptoms (e.g., ping failures, telnet failures, DNS lookup failures) and common causes such as firewall rules, routing misconfigurations, overloaded interfaces, or misconfigured CNI components.

Common Network Troubleshooting Tools

Key utilities are described with installation commands for various Linux distributions and example usage syntax: tcpdump – capture packets, filter by host, port, network, size, or TCP flags; save to PCAP files with -w and combine logical operators. nsenter – enter a container’s network namespace from the host using nsenter -t <pid> -n <command>, useful when the container lacks sudo or networking tools. paping – continuous TCP ping for port reachability and loss rate; install dependencies on RedHat/CentOS ( yum install -y libstdc++.i686 glibc.i686) or Debian/Ubuntu (no extra packages). mtr – combines traceroute and ping, showing loss percentage, latency statistics, and supports options like -n (numeric IP), -b (both host and IP), -c (count), -r (report), -m (max hops), and -s (packet size).

Pod Network Troubleshooting Workflow

The workflow diagram (illustrated in the original article) guides investigators from symptom identification through tool selection, packet capture points (node, veth, CNI device, host NIC), and analysis of NAT/DNAT behavior.

Case Study 1: Service Unreachable After Node Expansion

Environment: Kubernetes with Flannel VXLAN, kube-proxy in iptables mode, a registry service (ClusterIP 10.233.0.100, backend pod IP 10.233.65.46). After adding a new node (10.153.204.15), the node could not reach the service’s ClusterIP, while other nodes worked.

Checked kube-proxy pod status and logs – no errors.

Examined iptables NAT rules – appeared correct.

Verified routing to the backend pod’s subnet – present.

Captured packets on the sending node, flannel interfaces, and the backend node. Found that the source IP was rewritten incorrectly (10.153.204.228 instead of 10.153.204.15) due to the node having both a static IP and a DHCP‑assigned IP, causing IP conflict.

Resolution: removed the DHCP configuration from /etc/sysconfig/network-scripts/ifcfg-enp26s0f0 (set BOOTPROTO="none") and restarted Docker and kubelet.

Case Study 2: External Cloud Host Times Out Accessing Cluster Service

Scenario: A cloud VM attempts an HTTP POST to a NodePort service but times out. Telnet to the IP:port succeeds, indicating basic connectivity.

Wireshark capture shows TCP three‑way handshake succeeds, but large packets (≈1514 bytes) are repeatedly retransmitted without ACK.

MTU mismatch identified: the VM’s NIC MTU is 1500, while the Calico tunnel interface MTU is 1440, causing fragmentation loss.

Resolution: align MTU values by setting the VM’s NIC MTU to 1440 or adjusting Calico’s MTU to 1500.

Case Study 3: Pods Cannot Reach Object Storage by DNS

Environment: Public‑cloud Kubernetes with Calico IP‑IP mode, services exposed via NodePort, and pods using the cluster DNS which forwards to an upstream DNS.

nsenter into the pod shows IP connectivity to the storage endpoint but DNS lookup fails ("unknown server name").

Telnet to storage ports works; paping shows no loss.

Investigation reveals that newly created nodes have kube‑proxy pods stuck in Pending, preventing service IP routing, including DNS.

Root cause: kube‑proxy pods lacked the highest priority class, so they were evicted under resource pressure.

Resolution: assign system-node-critical priority class to kube‑proxy and add readiness probes to application pods.

Overall, the article emphasizes the importance of understanding failure models, using the right packet‑capture points, and correlating tool output with cluster configuration to pinpoint and fix network issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kubernetes network troubleshooting MTU CNI tcpdump nsenter Pod Connectivity

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.