Cloud Native 7 min read

The Mystery of the 3‑Minute NotReady State in Kubernetes Nodes

This article examines a recurring Kubernetes issue where a node becomes NotReady for exactly three minutes after a kubelet restart, explains the underlying PLEG and node‑status mechanisms, and offers a free, in‑depth Kubernetes handbook for further study.

Linux Cloud Computing Practice
Linux Cloud Computing Practice
Linux Cloud Computing Practice
The Mystery of the 3‑Minute NotReady State in Kubernetes Nodes

Preface

Alibaba Cloud provides its own Kubernetes container service. With the rapid growth of Kubernetes clusters, some users occasionally encounter nodes that enter a NotReady state.

These incidents occur roughly once or twice per month per customer. When a node becomes NotReady, the cluster master cannot schedule new Pods or retrieve real‑time information from the affected node.

Although a recent systemd fix has reduced the frequency of a previously reported issue, other NotReady problems still exist.

Problem Symptoms

The symptom is a node turning NotReady. Restarting the node temporarily resolves the issue, but it reappears after about 20 days.

After a restart, restarting the kubelet makes the node Ready for only three minutes before it flips back to NotReady.

Overall Logic

Four core components determine a node's readiness in a Kubernetes cluster: the etcd database, the API Server, the node controller, and the kubelet running on each node.

The kubelet acts both as a cluster controller—periodically fetching pod specifications from the API Server and managing pod execution—and as a node‑status monitor, reporting node conditions back to the API Server.

Node readiness is reported via the NodeStatus mechanism, which relies heavily on the Pod Lifecycle Events Generator (PLEG). PLEG periodically checks container status and generates events for the kubelet. If PLEG fails to complete its checks within its timeout, NodeStatus marks the node as NotReady.

Three‑Minute Ready Window

After restarting the kubelet, the node remains Ready for three minutes because the first PLEG check does not finish successfully. The three‑minute timeout is the default PLEG execution timeout; once it expires, the node is reported as NotReady.

The diagram below shows the normal PLEG flow versus the problematic flow where the initial check hangs, leading to the three‑minute delay before the NotReady status is propagated.

Key PLEG Parameters

The check interval is one second by default.

The check timeout is three minutes; if a check exceeds this duration, NodeStatus treats the node as NotReady.

This explains why the node stays Ready for exactly three minutes after a kubelet restart.

For readers who want a deeper dive, a free "Deep Dive into Kubernetes" handbook is available for download, covering twelve technical articles on cluster control, scaling, image pulling, and more.

cloud-nativeKubernetesTroubleshootingKubeletPLEGNode NotReady
Linux Cloud Computing Practice
Written by

Linux Cloud Computing Practice

Welcome to Linux Cloud Computing Practice. We offer high-quality articles on Linux, cloud computing, DevOps, networking and related topics. Dive in and start your Linux cloud computing journey!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.