Cloud Native 13 min read

Why kubectl get nodes stalls on one master: diagnosing Flannel ARP loss

A Kubernetes cluster experienced a 45‑second delay on one master when running kubectl get nodes due to missing flannel MAC entries, and the issue was resolved by restarting the canal CNI component to refresh ARP/FDB information.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Why kubectl get nodes stalls on one master: diagnosing Flannel ARP loss

Problem Description

On a Kubernetes v1.14.8 cluster, executing

kubectl get nodes

on master1 takes about 45 seconds, while the same command on other masters returns instantly. Logs from the kube‑apiserver show timeouts when trying to reach the metrics‑server, and master1 cannot ping other nodes' flannel.1 IPs.

Investigation Result

The root cause was that the MAC address for master1’s flannel.1 interface was missing in the ARP/FDB tables of the other nodes, preventing cross‑host pod communication.

Investigation Process

Environment Information

Kubernetes v1.14.8, deployed as static pods (apiserver, controller, scheduler, proxy). CNI: canal (flannel‑vxlan) deployed as a DaemonSet. Masters: 192.168.1.140‑142; Nodes: 192.168.1.143‑145.

Note: the following examples use master1 and master3 for illustration.

1. On master1,

kubectl get nodes

takes ~45 s:

<code>[root@master1 ~]$ time kubectl get nodes
NAME      STATUS   ROLES   AGE   VERSION
master1   Ready    <none>  100d  v1.14.8
master2   Ready    <none>  100d  v1.14.8
master3   Ready    <none>  100d  v1.14.8
node1     Ready    <none>  100d  v1.14.8
node2     Ready    <none>  100d  v1.14.8

real    0m45.0s</code>

On master3 the same command returns in less than half a second:

<code>[root@master3 ~]$ time kubectl get nodes
NAME      STATUS   ROLES   AGE   VERSION
master1   Ready    <none>  100d  v1.14.8
master2   Ready    <none>  100d  v1.14.8
master3   Ready    <none>  100d  v1.14.8
node1     Ready    <none>  100d  v1.14.8
node2     Ready    <none>  100d  v1.14.8

real    0m0.452s</code>

2. Apiserver logs on master1 show repeated timeouts when contacting the metrics‑server:

<code>E1124 11:40:21.145 1 available_controller.go:353] v1beta1.custom.metrics.k8s.io failed with: Get https://10.68.225.236:443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)</code>

Telnet to the metrics server succeeds from master3 but hangs on master1:

<code>[root@master1 ~]$ telnet 10.68.225.236 443
Trying 10.68.225.236...</code>
<code>[root@master3 ~]$ telnet 10.68.225.236 443
Trying 10.68.225.236...
Connected to 10.68.225.236.</code>

3. The metrics‑server pods run on node1‑3; master1 cannot reach their flannel.1 IPs, indicating a cross‑host communication problem.

4. ARP and FDB tables show that master1’s flannel.1 MAC address is absent on other nodes:

<code># master1 ARP/FDB (partial)
ip neigh | grep flannel.1
10.68.1.0 dev flannel.1 lladdr 16:23:8e:ab:c6:5c PERMANENT
...
bridge fdb | grep flannel.1
32:a3:2r:8e:fb:2r dev flannel.1 dst 192.168.1.143 self permanent
...</code>
<code># master3 ARP/FDB (partial)
ip neigh | grep flannel.1
10.68.0.0 dev flannel.1 INCOMPLETE
...
bridge fdb | grep flannel.1
36:9u:9c:53:4a:10 dev flannel.1 dst 192.168.1.140 self permanent
...</code>

5. Packet captures confirm that ICMP packets from master1 reach master3’s flannel.1 interface but cannot be returned because the MAC address is unknown:

<code># master1 tcpdump on flannel.1
IP 10.68.0.0 > 10.68.1.0: ICMP echo request, id 28296, seq 1, length 64</code>
<code># master3 tcpdump on eth0
IP 10.68.0.0 > 10.68.1.0: ICMP echo request, id 30689, seq 1, length 64</code>

6. Restarting the canal DaemonSet on the affected nodes refreshes the ARP/FDB entries:

<code>kubectl delete po -n kube-system canal-xxx</code>

After the restart, the missing MAC address appears in the tables and

kubectl get nodes

on master1 completes in under a second.

Flannel VXLAN data flow diagram
Flannel VXLAN data flow diagram
kubernetesNetwork TroubleshootingCanalFlannelARP
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.