How to Troubleshoot Kubernetes NotReady Nodes: A Complete Step‑by‑Step Guide
This article walks Kubernetes operators through a systematic investigation of NotReady node symptoms, explaining the kubelet status mechanism, detailing each diagnostic step—from verifying node conditions with kubectl to checking kubelet, container runtime, resources, network, and certificates—and providing concrete remediation and preventive measures.
Background
In production clusters a node can suddenly become NotReady, causing scheduled Pods to stay Pending or be evicted. NotReady is only a symptom; underlying causes include kubelet crashes, container‑runtime failures, network outages, etc. The workflow targets intermediate Kubernetes operators and demonstrates a complete end‑to‑end troubleshooting process verified on Kubernetes 1.24+.
Core Knowledge: Kubelet Node‑Status Reporting
Kubelet reports its node status to the API server every nodeStatusReportFrequency (default 10 s). If updates fail for longer than a threshold, the node controller marks the node Unknown or NotReady. The overall Ready condition is a composite of several sub‑conditions:
Conditions:
Type Status
MemoryPressure False
DiskPressure False
PIDPressure False
NetworkUnavailable False
kubeletReady True <-- calculated by kubeletIf any sub‑condition becomes True, kubeletReady flips to False and the node appears NotReady. The sub‑conditions are independent signals.
Overall Troubleshooting Flow
1. Confirm NotReady symptom and scope
└─ kubectl get nodes
└─ kubectl describe node NODE_NAME
2. Verify kubelet service
└─ systemctl status kubelet
└─ journalctl -u kubelet -n 200
3. Check container runtime (containerd/docker)
└─ systemctl status containerd
└─ crictl info
4. Inspect node resources (disk, memory, CPU)
└─ df -h, free -m, uptime
5. Test network connectivity
└─ ping/api‑server, curl /healthz, ip route
6. Examine certificate validity
└─ kubeadm certs check-expiration
└─ openssl x509 -in /var/lib/kubelet/pki/cert.crt -noout -dates
7. Locate root cause and apply fix
8. Validate recovery
└─ kubectl get node, run a test pod
9. Implement preventive measures (health‑check scripts, eviction thresholds, monitoring, cert renewal)Step‑by‑Step Details
1. Confirm NotReady Phenomenon
List node statuses with kubectl get nodes -o wide. Filter non‑Ready nodes if the cluster is large: kubectl get nodes | grep -v Ready. Then describe the affected node:
kubectl describe node NODE_NAMEIn the Conditions section identify which sub‑condition triggered NotReady (e.g., DiskPressure=True with message "low disk space"). Note that DiskPressure and kubeletReady are independent; kubeletReady=False may be caused by other failures such as runtime or network issues.
2. Check kubelet Process
Inspect the systemd unit:
# systemctl status kubelet
# journalctl -u kubelet -n 300 --no-pagerCommon error categories:
Container runtime connection failure – mismatched --container-runtime-endpoint or outdated /etc/containerd/config.toml.
Certificate issues – missing or expired client certificates.
etcd connection timeout – kubelet cannot reach the datastore.
3. Verify Container Runtime
For containerd:
# systemctl status containerd
# journalctl -u containerd -n 200
# crictl info
# sudo crictl ps -aFor Docker, check docker info and ensure the Cgroup driver matches kubelet (both systemd or both cgroupfs).
4. Inspect Node Resources
Disk pressure is the most frequent cause. Example output:
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 100G 95G 5G 95% <-- criticalKubelet’s default DiskPressure threshold is nodefs.available < 10%. Clean up logs, unused images (via ctr -n k8s.io images prune), or expand storage.
Memory pressure is detected when available memory falls below 10 % of total. Use free -m and ps aux --sort=-%mem | head to locate heavy processes. OOM kills of kubelet appear as:
[12345.678901] kubelet invoked oom‑killer: ...
[12345.678902] Memory cgroup out of memory: Killed process 12345 (kubelet)CPU overload rarely triggers NotReady directly but can cause kubelet timeouts.
5. Test Network Connectivity
Validate reachability to the API server:
# APISERVER=$(kubectl config view -o jsonpath='{.clusters[0].cluster.server}')
# curl -sk --max-time 5 ${APISERVER}/healthzCheck DNS resolution ( ping kubernetes.default.svc) and CNI plugin status (Flannel, Calico, etc.) using kubectl get pods -n kube-system and relevant logs.
6. Check Certificate Expiration
Kubelet certificates are valid for one year by default. Verify with:
# kubeadm certs check-expiration --cert-dir /etc/kubernetes/pki
# openssl x509 -in /var/lib/kubelet/pki/cert.crt -noout -datesIf expired, renew via kubeadm alpha certs renew kubelet.conf or delete the old certs and let kubelet re‑request them.
Common Root‑Cause Scenarios & Fixes
Disk space exhausted – clean logs, prune unused images, expand storage, and adjust evictionHard thresholds.
kubelet OOM killed – free memory, lower pod density, set a negative oom_score_adj for kubelet.
Runtime configuration mismatch – ensure SystemdCgroup=true in /etc/containerd/config.toml matches kubelet’s cgroupDriver: systemd, then restart both services.
etcd connectivity issues – verify etcd health with etcdctl endpoint health and restore the etcd cluster if needed.
Kubelet config errors – validate /var/lib/kubelet/config.yaml syntax (e.g., proper quoting of evictionHard values).
Kernel bugs or outdated version – upgrade the OS kernel or reinstall the node.
7. Verify Recovery
After remediation, confirm the node returns to Ready:
# kubectl get nodes -o wide
# kubectl describe node NODE_NAME | grep -A 20 "Conditions"Check that previously evicted Pods are rescheduled and that business services (Deployments replicas, Services, Ingress health checks, DNS inside a test pod) are functional.
8. Preventive Measures & Daily Checks
Example health‑check script (run via cron) that reports non‑Ready nodes, disk pressure, memory pressure, and service status:
#!/bin/bash
NODES=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
ALERT=""
for NODE in $NODES; do
STATUS=$(kubectl get node $NODE -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
if [ "$STATUS" != "True" ]; then
ALERT+="
[node NotReady] $NODE"
kubectl describe node $NODE | grep -A 5 "Conditions" >> /var/log/k8s-health.log
fi
DISK=$(ssh $NODE "df -h / | tail -1 | awk '{print \$5}' | tr -d %")
if [ $DISK -gt 85 ]; then
ALERT+="
[disk pressure] $NODE: $DISK%"
fi
MEM_AVAIL=$(ssh $NODE "free -m | awk 'NR==2{print \$7}'")
MEM_TOTAL=$(ssh $NODE "free -m | awk 'NR==2{print \$2}'")
MEM_USAGE=$(( (MEM_TOTAL-MEM_AVAIL)*100/MEM_TOTAL ))
if [ $MEM_USAGE -gt 90 ]; then
ALERT+="
[memory pressure] $NODE: $MEM_USAGE%"
fi
KUBELET=$(ssh $NODE "systemctl is-active kubelet")
if [ "$KUBELET" != "active" ]; then
ALERT+="
[kubelet down] $NODE"
fi
done
if [ -n "$ALERT" ]; then
echo -e "K8s Node Health Alert:$ALERT" | tee -a /var/log/k8s-health.log
else
echo "$(date): All nodes healthy" >> /var/log/k8s-health.log
fiAdditional long‑term safeguards:
Configure stricter evictionHard and evictionSoft thresholds in /var/lib/kubelet/config.yaml.
Deploy Prometheus Node Exporter with Alertmanager rules for disk, memory, and kubelet health.
Schedule monthly certificate renewal via kubeadm certs renew all.
Set per‑node pod limits ( --max-pods) and enforce resource requests/limits on workloads.
By following this structured approach—observing the symptom, walking through layered checks, pinpointing the exact cause, applying the targeted fix, and finally establishing preventive automation—operators can resolve NotReady incidents efficiently and reduce recurrence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
