Why Does a Kubernetes Node Stay Ready Only 3 Minutes After Restart?
This article examines a recurring Kubernetes node NotReady issue where nodes become ready for only three minutes after a kubelet restart, detailing the underlying PLEG mechanism, component interactions, and diagnostic steps to resolve the problem.
Anyone who follows cloud computing knows Docker and Kubernetes have risen to prominence, and major public cloud providers now offer managed Kubernetes services. Kubernetes is powerful and highly extensible, often seen as the ultimate cloud‑native solution.
The author, a senior Alibaba Cloud technical expert, has compiled a practical guide covering theory and hands‑on practice, including cluster control, scaling, and image pulling.
Preface
Alibaba Cloud provides its own Kubernetes container‑cluster product. With the rapid increase of Kubernetes deployments, some users have sporadically observed nodes entering a NotReady state.
Typically, one to two customers encounter this issue each month. After a node becomes NotReady, the cluster master cannot control the node—no new Pods can be scheduled, and real‑time information about running Pods cannot be retrieved.
Although this particular problem has been fixed in systemd, other node‑readiness issues still occur for different reasons.
Problem Phenomenon
The symptom is a node turning NotReady again after about 20 days. Restarting the node temporarily resolves it, but the issue reappears.
Restarting the kubelet makes the node Ready for only three minutes before it reverts to NotReady.
Big Logic Behind Node Readiness
Four core components affect node readiness: the etcd database, the API Server, the node controller, and the kubelet running on each node.
The kubelet acts both as a cluster controller—periodically fetching Pod specs from the API Server and managing Pod lifecycles—and as a node‑status monitor, reporting node conditions back to the API Server.
Kubelet uses the NodeStatus mechanism, which relies heavily on the Pod Lifecycle Events Generator (PLEG). PLEG periodically checks container status and reports changes as events to the kubelet’s sync loop. If PLEG fails to complete its checks within a timeout, NodeStatus marks the node as NotReady.
Three‑Minute Ready Window
After restarting kubelet, the node remains Ready for exactly three minutes before becoming NotReady again. This aligns with PLEG’s default timeout of three minutes: if a PLEG check does not finish within that period, NodeStatus reports the node as NotReady.
The official PLEG diagram shows two processes: (1) kubelet fetching Pod spec changes from the API Server and creating or terminating Pods, and (2) PLEG periodically checking container status and feeding events back to kubelet.
kubelet as controller creates/ends Pods based on API Server updates.
PLEG checks container status and reports events to kubelet.
PLEG runs every second, and each check has a three‑minute timeout. When the kubelet restarts, the first PLEG check often hangs, causing the three‑minute timeout to expire and the node to be marked NotReady.
Linux Cloud Computing Practice
Welcome to Linux Cloud Computing Practice. We offer high-quality articles on Linux, cloud computing, DevOps, networking and related topics. Dive in and start your Linux cloud computing journey!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
