Cloud Native 16 min read

Why Does Kubelet Mark Nodes NotReady After 3 Minutes? A Deep Dive into PLEG and Terway Issues

The article analyzes a recurring Kubernetes node NotReady problem where restarting kubelet only restores readiness for three minutes, explains the underlying PLEG timeout mechanism, investigates unresponsive Terway CNI plugin behavior, and presents a netlink‑timeout fix to prevent the issue.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Why Does Kubelet Mark Nodes NotReady After 3 Minutes? A Deep Dive into PLEG and Terway Issues

Background

Alibaba Cloud provides a Kubernetes container‑cluster service. As the number of clusters grew, customers occasionally observed nodes entering the NotReady state. The issue appeared roughly once or twice per month, and when a node became NotReady the master could no longer schedule pods or retrieve pod status from that node.

Problem Phenomenon

The symptom is that a node turns NotReady, a manual kubelet restart makes it Ready again, but the readiness lasts only about three minutes. After roughly twenty days the problem recurs, and the node again flips to NotReady.

Core Cluster Components

Kubernetes node readiness depends on four core components: the etcd key‑value store, the API Server, the node controller, and the kubelet running on each node.

PLEG Mechanism

kubelet

uses the NodeStatus mechanism to report node health to the API Server. A key part of this is the Pod Lifecycle Events Generator (PLEG), which periodically inspects pod status. By default PLEG runs every second and has a three‑minute timeout; if a check does not finish within three minutes, NodeStatus marks the node as NotReady.

Observed Logs

17:08:22.299597 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.000091019s ago; threshold is 3m0s]
17:08:22.399758 kubelet skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.100259802s ago; threshold is 3m0s]
…

The logs show repeated “PLEG is not healthy” messages exactly at the three‑minute mark.

Kubelet Stack Traces

kubelet: k8s.io/kubernetes/vendor/google.golang.org/grpc/transport.(*Stream).Header()
…
 k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist()
 k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).updateCache()

These traces reveal that the PLEG relist function is stuck, causing the timeout.

Terwayd Interaction

Terway is the CNI plugin used by Alibaba Cloud. Terwayd is the daemon that implements the CNI server side, similar to the relationship between flannel and flanneld. When configuring pod networking, kubelet calls Terway, but Terwayd sometimes becomes unresponsive, leaving thousands of Terway processes waiting.

kubelet: net/rpc.(*Client).Call()
…
github.com/AliyunContainerService/terway/vendor/github.com/containernetworking/cni/pkg/skel.PluginMain()

Sending SIGABRT to Terwayd prints a stack trace that shows many threads blocked on a recvfrom system call.

Root‑Cause Analysis

The blocking occurs in the netlink library, which communicates with the kernel via sockets. A specific goroutine holds a lock while waiting on recvfrom for over 1,595 minutes, preventing other pod‑network operations from completing.

goroutine 67570 [syscall, 1595 minutes, locked to thread]:
syscall.Syscall6()
…
github.com/AliyunContainerService/terway/daemon.SetupVethPair()

Because PLEG relies on successful pod‑network setup, the netlink blockage propagates to the node heartbeat, causing the three‑minute Ready window.

Fix

The proposed fix assumes netlink is not 100 % reliable; adding explicit timeouts to netlink operations prevents Terwayd from being permanently blocked, allowing kubelet to continue reporting node health.

Conclusion

Kubelet implements a heartbeat that periodically syncs node metrics, including readiness, to the control plane. Plugins such as CNI (Terway) and other system components directly affect this heartbeat. Ensuring that plugin calls have proper timeouts avoids the three‑minute NotReady cascade demonstrated in this case.

cloud-nativeKubernetesTroubleshootingCNITerwayPLEGNodeReady
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.