Why Did My Kubernetes Node Stay NotReady? OOM Killer, PLEG, and Fixes
A high‑load Kubernetes node entered NotReady due to repeated OOM‑killer activity, daemonset restarts, and PLEG health failures, and the article walks through diagnosis, log analysis, root‑cause explanation, and practical remediation steps to restore node readiness.
Problem Overview
The alert indicated a host in NotReady state with high CPU, memory, and load. Investigation showed several pods stuck in "Deleting" or "Restarting" states, and the node could not be accessed via SSH.
Investigation Steps
1. Logged into the container runtime and confirmed the node status was NotReady. 2. Observed pods in deletion and restart loops. 3. Checked Huawei Cloud monitoring and saw CPU, memory, and load metrics all high. 4. Collected kernel OOM logs (see code block below). 5. Verified kubelet memory reservation was only 100 MiB and that pod limits exceeded the node’s capacity.
<code>Jun 15 16:50:50 online-1-19-10-30740-b7pqf kernel: Out of memory: Kill process 3241 (filebeat) score 1004 or sacrifice child
Jun 15 16:50:50 online-1-19-10-30740-b7pqf kernel: Killed process 3241 (filebeat), UID 0, total-vm:1702220kB, anon-rss:75196kB
Jun 15 16:52:31 online-1-19-10-30740-b7pqf kernel: Out of memory: Kill process 18271 (java) score 1002 or sacrifice child
Jun 15 16:52:31 online-1-19-10-30740-b7pqf kernel: Killed process 18271 (java), UID 1000, total-vm:10065552kB, anon-rss:5265332kB
... (additional OOM entries omitted for brevity) ...</code>Root Cause Analysis
The kubelet memory reservation of only 100 MiB combined with pod limits far exceeding the node’s physical memory caused the node to run out of memory when a pod was being terminated during a deployment, triggering an OOM event.
Linux OOM‑killer selects processes based on
oom_score. It first killed the daemonset‑managed
filebeatprocess, but the daemonset immediately recreated it, leading to another OOM cycle.
Continuous OOM‑killer activity kept the node’s load high, causing the kubelet PLEG (Pod Lifecycle Event Generator) to become unhealthy; after >3 minutes the node was marked NotReady by the control plane.
Solution
Manually migrate business pods to other nodes, then restart the affected node. After reboot, CPU, memory, and load returned to normal and the node status became Ready.
Restart kubelet to clear the "PLEG is not healthy" error caused by container‑runtime communication failure.
Optimizations
Increase kubelet memory reservation to 1 GiB to prevent kubelet crashes from OOM.
Set explicit memory
requestand
limitfor daemonsets (e.g., filebeat) to avoid uncontrolled memory consumption.
Adjust pod limits so they stay well within the node’s physical limits, preventing the node from operating near the OOM threshold.
Key Takeaways
When a pod is in "Terminating" it may need extra resources to finish cleanup; insufficient node memory can cause OOM during this phase.
Cluster resource utilization should leave headroom on each node, not just overall cluster usage, to handle transient spikes.
WeiLi Technology Team
Practicing data-driven principles and believing technology can change the world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.