Cloud Native 35 min read

How to Troubleshoot Kubernetes NotReady Nodes: A Complete Step‑by‑Step Guide

This article walks Kubernetes operators through a systematic investigation of NotReady node symptoms, explaining the kubelet status mechanism, detailing each diagnostic step—from verifying node conditions with kubectl to checking kubelet, container runtime, resources, network, and certificates—and providing concrete remediation and preventive measures.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Troubleshoot Kubernetes NotReady Nodes: A Complete Step‑by‑Step Guide

Background

In production clusters a node can suddenly become NotReady, causing scheduled Pods to stay Pending or be evicted. NotReady is only a symptom; underlying causes include kubelet crashes, container‑runtime failures, network outages, etc. The workflow targets intermediate Kubernetes operators and demonstrates a complete end‑to‑end troubleshooting process verified on Kubernetes 1.24+.

Core Knowledge: Kubelet Node‑Status Reporting

Kubelet reports its node status to the API server every nodeStatusReportFrequency (default 10 s). If updates fail for longer than a threshold, the node controller marks the node Unknown or NotReady. The overall Ready condition is a composite of several sub‑conditions:

Conditions:
  Type               Status
  MemoryPressure    False
  DiskPressure      False
  PIDPressure       False
  NetworkUnavailable False
  kubeletReady      True   <-- calculated by kubelet

If any sub‑condition becomes True, kubeletReady flips to False and the node appears NotReady. The sub‑conditions are independent signals.

Overall Troubleshooting Flow

1. Confirm NotReady symptom and scope
   └─ kubectl get nodes
   └─ kubectl describe node NODE_NAME
2. Verify kubelet service
   └─ systemctl status kubelet
   └─ journalctl -u kubelet -n 200
3. Check container runtime (containerd/docker)
   └─ systemctl status containerd
   └─ crictl info
4. Inspect node resources (disk, memory, CPU)
   └─ df -h, free -m, uptime
5. Test network connectivity
   └─ ping/api‑server, curl /healthz, ip route
6. Examine certificate validity
   └─ kubeadm certs check-expiration
   └─ openssl x509 -in /var/lib/kubelet/pki/cert.crt -noout -dates
7. Locate root cause and apply fix
8. Validate recovery
   └─ kubectl get node, run a test pod
9. Implement preventive measures (health‑check scripts, eviction thresholds, monitoring, cert renewal)

Step‑by‑Step Details

1. Confirm NotReady Phenomenon

List node statuses with kubectl get nodes -o wide. Filter non‑Ready nodes if the cluster is large: kubectl get nodes | grep -v Ready. Then describe the affected node:

kubectl describe node NODE_NAME

In the Conditions section identify which sub‑condition triggered NotReady (e.g., DiskPressure=True with message "low disk space"). Note that DiskPressure and kubeletReady are independent; kubeletReady=False may be caused by other failures such as runtime or network issues.

2. Check kubelet Process

Inspect the systemd unit:

# systemctl status kubelet
# journalctl -u kubelet -n 300 --no-pager

Common error categories:

Container runtime connection failure – mismatched --container-runtime-endpoint or outdated /etc/containerd/config.toml.

Certificate issues – missing or expired client certificates.

etcd connection timeout – kubelet cannot reach the datastore.

3. Verify Container Runtime

For containerd:

# systemctl status containerd
# journalctl -u containerd -n 200
# crictl info
# sudo crictl ps -a

For Docker, check docker info and ensure the Cgroup driver matches kubelet (both systemd or both cgroupfs).

4. Inspect Node Resources

Disk pressure is the most frequent cause. Example output:

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   95G   5G  95%   <-- critical

Kubelet’s default DiskPressure threshold is nodefs.available < 10%. Clean up logs, unused images (via ctr -n k8s.io images prune), or expand storage.

Memory pressure is detected when available memory falls below 10 % of total. Use free -m and ps aux --sort=-%mem | head to locate heavy processes. OOM kills of kubelet appear as:

[12345.678901] kubelet invoked oom‑killer: ...
[12345.678902] Memory cgroup out of memory: Killed process 12345 (kubelet)

CPU overload rarely triggers NotReady directly but can cause kubelet timeouts.

5. Test Network Connectivity

Validate reachability to the API server:

# APISERVER=$(kubectl config view -o jsonpath='{.clusters[0].cluster.server}')
# curl -sk --max-time 5 ${APISERVER}/healthz

Check DNS resolution ( ping kubernetes.default.svc) and CNI plugin status (Flannel, Calico, etc.) using kubectl get pods -n kube-system and relevant logs.

6. Check Certificate Expiration

Kubelet certificates are valid for one year by default. Verify with:

# kubeadm certs check-expiration --cert-dir /etc/kubernetes/pki
# openssl x509 -in /var/lib/kubelet/pki/cert.crt -noout -dates

If expired, renew via kubeadm alpha certs renew kubelet.conf or delete the old certs and let kubelet re‑request them.

Common Root‑Cause Scenarios & Fixes

Disk space exhausted – clean logs, prune unused images, expand storage, and adjust evictionHard thresholds.

kubelet OOM killed – free memory, lower pod density, set a negative oom_score_adj for kubelet.

Runtime configuration mismatch – ensure SystemdCgroup=true in /etc/containerd/config.toml matches kubelet’s cgroupDriver: systemd, then restart both services.

etcd connectivity issues – verify etcd health with etcdctl endpoint health and restore the etcd cluster if needed.

Kubelet config errors – validate /var/lib/kubelet/config.yaml syntax (e.g., proper quoting of evictionHard values).

Kernel bugs or outdated version – upgrade the OS kernel or reinstall the node.

7. Verify Recovery

After remediation, confirm the node returns to Ready:

# kubectl get nodes -o wide
# kubectl describe node NODE_NAME | grep -A 20 "Conditions"

Check that previously evicted Pods are rescheduled and that business services (Deployments replicas, Services, Ingress health checks, DNS inside a test pod) are functional.

8. Preventive Measures & Daily Checks

Example health‑check script (run via cron) that reports non‑Ready nodes, disk pressure, memory pressure, and service status:

#!/bin/bash
NODES=$(kubectl get nodes -o jsonpath='{.items[*].metadata.name}')
ALERT=""
for NODE in $NODES; do
  STATUS=$(kubectl get node $NODE -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
  if [ "$STATUS" != "True" ]; then
    ALERT+="
[node NotReady] $NODE"
    kubectl describe node $NODE | grep -A 5 "Conditions" >> /var/log/k8s-health.log
  fi
  DISK=$(ssh $NODE "df -h / | tail -1 | awk '{print \$5}' | tr -d %")
  if [ $DISK -gt 85 ]; then
    ALERT+="
[disk pressure] $NODE: $DISK%"
  fi
  MEM_AVAIL=$(ssh $NODE "free -m | awk 'NR==2{print \$7}'")
  MEM_TOTAL=$(ssh $NODE "free -m | awk 'NR==2{print \$2}'")
  MEM_USAGE=$(( (MEM_TOTAL-MEM_AVAIL)*100/MEM_TOTAL ))
  if [ $MEM_USAGE -gt 90 ]; then
    ALERT+="
[memory pressure] $NODE: $MEM_USAGE%"
  fi
  KUBELET=$(ssh $NODE "systemctl is-active kubelet")
  if [ "$KUBELET" != "active" ]; then
    ALERT+="
[kubelet down] $NODE"
  fi
done
if [ -n "$ALERT" ]; then
  echo -e "K8s Node Health Alert:$ALERT" | tee -a /var/log/k8s-health.log
else
  echo "$(date): All nodes healthy" >> /var/log/k8s-health.log
fi

Additional long‑term safeguards:

Configure stricter evictionHard and evictionSoft thresholds in /var/lib/kubelet/config.yaml.

Deploy Prometheus Node Exporter with Alertmanager rules for disk, memory, and kubelet health.

Schedule monthly certificate renewal via kubeadm certs renew all.

Set per‑node pod limits ( --max-pods) and enforce resource requests/limits on workloads.

By following this structured approach—observing the symptom, walking through layered checks, pinpointing the exact cause, applying the targeted fix, and finally establishing preventive automation—operators can resolve NotReady incidents efficiently and reduce recurrence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringKubernetesTroubleshootingcontainerdetcdkubeletnode healthNotReady
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.