Why Does PLEG ‘Not Healthy’ Make a Kubernetes Node NotReady?
This article explains the role of the Pod Lifecycle Event Generator (PLEG) in Kubelet, why the “PLEG is not healthy” error causes nodes to become NotReady, common failure scenarios, and a step‑by‑step troubleshooting method that ultimately resolves the issue by upgrading systemd.
Problem Description
Environment: Ubuntu 18.04 with a self‑built Kubernetes 1.18 cluster using Docker as the container runtime. A node repeatedly becomes NotReady, and kubectl describe node shows the error “PLEG is not healthy: pleg was last seen active 3m46.752815514s ago; threshold is 3m0s”, occurring every 5‑10 minutes.
What is PLEG?
PLEG (Pod Lifecycle Event Generator) is a module inside Kubelet. Its main responsibility is to watch pod‑level events, reconcile the container runtime state, and update the pod cache so that the cache reflects the latest pod status.
Kubelet runs on each node and must react promptly to two sources of change:
Changes defined in the pod spec.
Changes in the container runtime state.
To keep the cache up‑to‑date, Kubelet watches the pod spec and periodically polls the container runtime (default every 10 seconds). As the number of pods grows, this polling creates significant CPU overhead and can overload the runtime, reducing reliability and scalability. PLEG was introduced to reduce this overhead by limiting work during idle periods and decreasing concurrent runtime queries.
Kubelet acts both as a cluster controller (fetching resources from the API server and driving pod execution) and as a node‑status monitor (periodically reporting node health to the API server). The NodeStatus mechanism relies heavily on PLEG to decide whether a node is Ready.
PLEG Workflow
PLEG periodically checks the state of pods on the node. If it detects changes, it packages them into events and sends them to Kubelet’s main sync loop. When PLEG cannot perform its checks in time, NodeStatus marks the node as NotReady.
Why “PLEG is not healthy” Happens
The error indicates that the container runtime (Docker daemon) is unhealthy, causing PLEG to fail its health checks. Historically Docker used a monolithic daemon, but modern Docker delegates lifecycle management to containerd and runc. PLEG checks the runtime by invoking runc ’s relist(), which is similar to running docker ps and docker inspect on all containers.
Common Scenarios Leading to the Error
Container runtime becomes unresponsive or times out (e.g., Docker daemon hangs).
Too many containers on the node, causing the relist process to exceed the 3‑minute timeout.
A deadlock bug in relist (fixed in Kubernetes 1.14).
Network issues.
Investigation Steps
1. On the problematic node, top shows a scope process consuming 100 % CPU. This is a systemd.scope unit that manages a group of external processes.
2. docker ps hangs, confirming the runtime is stuck.
3. Reference Alibaba’s Kubernetes troubleshooting guide, which links the issue to systemd.
What is D‑Bus?
D‑Bus is an inter‑process communication mechanism on Linux.
RunC and D‑Bus Interaction
RunC (the container runtime) writes to a D‑Bus socket with an org.freedesktop field, where it can become blocked.
Resolution
Restarting systemd with systemctl daemon-reexec clears the blockage, and the node returns to Ready. The root cause is a bug in the systemd version; upgrading systemd to v242‑rc2 and rebooting the host resolves the issue permanently.
Summary
The “PLEG is not healthy” error is often triggered by a malfunctioning systemd, which prevents the container runtime from responding. Upgrading systemd to a newer version and restarting it fixes the problem, restoring node health.
Kubelet: Pod Lifecycle Event Generator (PLEG)
Kubelet: Runtime Pod Cache
relist() in kubernetes/pkg/kubelet/pleg/generic.go Past bug about CNI — PLEG is not healthy error, node marked NotReady
https://www.infoq.cn/article/t_ZQeWjJLGWGT8BmmiU4
https://cloud.tencent.com/developer/article/1550038
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
