Why Does PLEG ‘Not Healthy’ Make a Kubernetes Node NotReady?
This article explains the role of the Pod Lifecycle Event Generator (PLEG) in Kubelet, why the “PLEG is not healthy” error causes nodes to become NotReady, common failure scenarios, and a step‑by‑step troubleshooting method that ultimately resolves the issue by upgrading systemd.
Problem Description
Environment: Ubuntu 18.04 with a self‑built Kubernetes 1.18 cluster using Docker as the container runtime. A node repeatedly becomes NotReady, and
kubectl describe nodeshows the error “PLEG is not healthy: pleg was last seen active 3m46.752815514s ago; threshold is 3m0s”, occurring every 5‑10 minutes.
What is PLEG?
PLEG (Pod Lifecycle Event Generator) is a module inside Kubelet. Its main responsibility is to watch pod‑level events, reconcile the container runtime state, and update the pod cache so that the cache reflects the latest pod status.
Kubelet runs on each node and must react promptly to two sources of change:
Changes defined in the pod spec.
Changes in the container runtime state.
To keep the cache up‑to‑date, Kubelet watches the pod spec and periodically polls the container runtime (default every 10 seconds). As the number of pods grows, this polling creates significant CPU overhead and can overload the runtime, reducing reliability and scalability. PLEG was introduced to reduce this overhead by limiting work during idle periods and decreasing concurrent runtime queries.
Kubelet acts both as a cluster controller (fetching resources from the API server and driving pod execution) and as a node‑status monitor (periodically reporting node health to the API server). The NodeStatus mechanism relies heavily on PLEG to decide whether a node is Ready.
PLEG Workflow
PLEG periodically checks the state of pods on the node. If it detects changes, it packages them into events and sends them to Kubelet’s main sync loop. When PLEG cannot perform its checks in time, NodeStatus marks the node as NotReady.
Why “PLEG is not healthy” Happens
The error indicates that the container runtime (Docker daemon) is unhealthy, causing PLEG to fail its health checks. Historically Docker used a monolithic daemon, but modern Docker delegates lifecycle management to
containerdand
runc. PLEG checks the runtime by invoking
runc’s
relist(), which is similar to running
docker psand
docker inspecton all containers.
Common Scenarios Leading to the Error
Container runtime becomes unresponsive or times out (e.g., Docker daemon hangs).
Too many containers on the node, causing the relist process to exceed the 3‑minute timeout.
A deadlock bug in relist (fixed in Kubernetes 1.14).
Network issues.
Investigation Steps
1. On the problematic node,
topshows a
scopeprocess consuming 100 % CPU. This is a systemd.scope unit that manages a group of external processes.
2.
docker pshangs, confirming the runtime is stuck.
3. Reference Alibaba’s Kubernetes troubleshooting guide, which links the issue to systemd.
What is D‑Bus?
D‑Bus is an inter‑process communication mechanism on Linux.
RunC and D‑Bus Interaction
RunC (the container runtime) writes to a D‑Bus socket with an
org.freedesktopfield, where it can become blocked.
Resolution
Restarting systemd with
systemctl daemon-reexecclears the blockage, and the node returns to Ready. The root cause is a bug in the systemd version; upgrading systemd to v242‑rc2 and rebooting the host resolves the issue permanently.
Summary
The “PLEG is not healthy” error is often triggered by a malfunctioning systemd, which prevents the container runtime from responding. Upgrading systemd to a newer version and restarting it fixes the problem, restoring node health.
Kubelet: Pod Lifecycle Event Generator (PLEG)
Kubelet: Runtime Pod Cache
relist() in
kubernetes/pkg/kubelet/pleg/generic.goPast bug about CNI — PLEG is not healthy error, node marked NotReady
https://www.infoq.cn/article/t_ZQeWjJLGWGT8BmmiU4
https://cloud.tencent.com/developer/article/1550038
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.