Operations 9 min read

Why Did My Kubernetes Node Stay NotReady? OOM Killer, PLEG, and Fixes

A high‑load Kubernetes node entered NotReady due to repeated OOM‑killer activity, daemonset restarts, and PLEG health failures, and the article walks through diagnosis, log analysis, root‑cause explanation, and practical remediation steps to restore node readiness.

WeiLi Technology Team

Aug 31, 2023

Why Did My Kubernetes Node Stay NotReady? OOM Killer, PLEG, and Fixes

Problem Overview

The alert indicated a host in NotReady state with high CPU, memory, and load. Investigation showed several pods stuck in "Deleting" or "Restarting" states, and the node could not be accessed via SSH.

Investigation Steps

1. Logged into the container runtime and confirmed the node status was NotReady. 2. Observed pods in deletion and restart loops. 3. Checked Huawei Cloud monitoring and saw CPU, memory, and load metrics all high. 4. Collected kernel OOM logs (see code block below). 5. Verified kubelet memory reservation was only 100 MiB and that pod limits exceeded the node’s capacity.

Jun 15 16:50:50 online-1-19-10-30740-b7pqf kernel: Out of memory: Kill process 3241 (filebeat) score 1004 or sacrifice child
Jun 15 16:50:50 online-1-19-10-30740-b7pqf kernel: Killed process 3241 (filebeat), UID 0, total-vm:1702220kB, anon-rss:75196kB
Jun 15 16:52:31 online-1-19-10-30740-b7pqf kernel: Out of memory: Kill process 18271 (java) score 1002 or sacrifice child
Jun 15 16:52:31 online-1-19-10-30740-b7pqf kernel: Killed process 18271 (java), UID 1000, total-vm:10065552kB, anon-rss:5265332kB
... (additional OOM entries omitted for brevity) ...

Root Cause Analysis

The kubelet memory reservation of only 100 MiB combined with pod limits far exceeding the node’s physical memory caused the node to run out of memory when a pod was being terminated during a deployment, triggering an OOM event.

Linux OOM‑killer selects processes based on oom_score. It first killed the daemonset‑managed filebeat process, but the daemonset immediately recreated it, leading to another OOM cycle.

Continuous OOM‑killer activity kept the node’s load high, causing the kubelet PLEG (Pod Lifecycle Event Generator) to become unhealthy; after >3 minutes the node was marked NotReady by the control plane.

Solution

Manually migrate business pods to other nodes, then restart the affected node. After reboot, CPU, memory, and load returned to normal and the node status became Ready.

Restart kubelet to clear the "PLEG is not healthy" error caused by container‑runtime communication failure.

Optimizations

Increase kubelet memory reservation to 1 GiB to prevent kubelet crashes from OOM.

Set explicit memory request and limit for daemonsets (e.g., filebeat) to avoid uncontrolled memory consumption.

Adjust pod limits so they stay well within the node’s physical limits, preventing the node from operating near the OOM threshold.

Key Takeaways

When a pod is in "Terminating" it may need extra resources to finish cleanup; insufficient node memory can cause OOM during this phase.

Cluster resource utilization should leave headroom on each node, not just overall cluster usage, to handle transient spikes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Troubleshooting Node PLEG

Written by

WeiLi Technology Team

Practicing data-driven principles and believing technology can change the world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.