Cloud Native 28 min read

How to Diagnose Kubernetes Pod Network Issues: Tools, Models, and Real‑World Cases

This article introduces a systematic approach for troubleshooting Kubernetes pod network problems, covering common failure models, essential diagnostic tools such as tcpdump, nsenter, paping and mtr, and detailed case studies that illustrate step‑by‑step analysis and resolution techniques.

Efficient Ops
Efficient Ops
Efficient Ops
How to Diagnose Kubernetes Pod Network Issues: Tools, Models, and Real‑World Cases

1. Pod Network Anomalies

Network anomalies can be classified into several categories:

Network unreachable – ping fails, caused by firewall rules, incorrect routing, high system load, or link failures.

Port unreachable – ping works but telnet fails, caused by firewall restrictions, high load, or the application not listening.

DNS resolution failure – domain names cannot be resolved while IP connectivity works, caused by incorrect pod DNS settings, DNS service issues, or communication problems with the DNS service.

Large packet loss – small packets succeed but large packets are dropped, often due to MTU mismatches; you can test with

ping -s

.

CNI plugin issues – node can communicate but pods cannot reach cluster addresses, often due to kube‑proxy or CIDR exhaustion.

The overall classification is illustrated in the following diagram:

In summary, the most common pod network failures are network unreachable, port unreachable, DNS resolution errors, and large‑packet loss.

2. Common Network Diagnostic Tools

tcpdump

tcpdump is a powerful packet capture tool. Installation commands:

Ubuntu/Debian:

apt-get install -y tcpdump

CentOS/Fedora:

yum install -y tcpdump

Alpine:

apk add tcpdump --no-cache

Typical usage examples:

<code>tcpdump -D</code>
<code>tcpdump host 1.1.1.1</code>
<code>tcpdump src|dst 1.1.1.1</code>
<code>tcpdump net 1.2.3.0/24</code>
<code>tcpdump -c 1 -X icmp</code>
<code>tcpdump port 3389</code>
<code>tcpdump portrange 21-23</code>
<code>tcpdump less 32</code>
<code>tcpdump greater 64</code>
<code>tcpdump -w capture_file</code>

Logical operators can be combined, e.g.:

<code>tcpdump -i eth0 -nn host 220.181.57.216 and 10.0.0.1</code>
<code>tcpdump -i eth0 -nn host 220.181.57.216 or 10.0.0.1</code>
<code>tcpdump -i eth0 -nn host 10.0.0.1 and (10.0.0.9 or 10.0.0.3)</code>

TCP flag filters (RST, SYN, ACK, etc.) are also supported.

nsenter

nsenter allows entering a process’s network namespace. Example syntax:

<code>nsenter -t &lt;pid&gt; -n &lt;command&gt;</code>

To inspect a pod’s network from the host:

<code># Find the pod’s PID</code>
<code>ps -ef|grep tail</code>
<code># Enter the namespace</code>
<code>nsenter -t 30858 -n ifconfig</code>

paping

paping continuously pings a target TCP port, useful for testing connectivity and packet loss.

<code>paping -p 80 -c 10 example.com</code>

Installation dependencies vary by OS (e.g., libstdc++.i686 on RHEL/CentOS).

mtr

mtr combines traceroute and ping, providing loss percentage, latency statistics, and more.

<code>mtr google.com</code>
<code>mtr -n google.com</code>
<code>mtr -b google.com</code>
<code>mtr -c 5 google.com</code>

Key columns: last, avg, best, worst, stdev. Loss% > 0 indicates possible issues; high stdev suggests unstable latency.

Tips: For more network tools, refer to additional resources.

3. Pod Network Troubleshooting Workflow

The troubleshooting process follows the diagram below:

Pod network troubleshooting idea

4. Case Studies

Node Expansion – Service Unreachable

After adding a new worker node, the node could not reach the ClusterIP of a registry service, while other nodes worked fine.

Investigation steps:

Verified CNI plugin (flannel vxlan) and kube‑proxy (iptables) were functioning.

Confirmed the registry pod itself was reachable.

Checked iptables NAT rules – they were correct.

Examined routing tables; the problematic node had two IP addresses on the same NIC (static + DHCP), causing IP conflict.

Resolution: Removed the DHCP configuration (set BOOTPROTO="none"), then restarted Docker and kubelet.

External Cloud Host – Timeout

A cloud VM could telnet to a NodePort service but HTTP POST requests timed out.

Analysis revealed large packets (>1400 bytes) were repeatedly retransmitted due to MTU mismatch (host MTU 1500 vs. Calico tunnel MTU 1440).

Fix: Align MTU values by setting the host NIC to 1440 or adjusting Calico’s MTU to 1500.

Pod Accessing Object Storage – DNS Timeout

Pods could reach the storage IP but failed DNS resolution for the storage domain.

Root cause: kube‑proxy pods on newly added nodes were pending because they lacked the highest priority class; when resources were scarce, kube‑proxy was evicted, breaking service/DNS access.

Solution: Assign

system-node-critical

priority class to kube‑proxy and add readiness probes for dependent pods.

Source: https://www.cnblogs.com/Cylon/p/16611503.html

KubernetesNetwork TroubleshootingiptablesCNIPodtcpdumpnsenter
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.