Kubernetes Node Failures: One‑Stop Guide to Diagnose and Fix Common Issues
This comprehensive guide walks Kubernetes operators through a step‑by‑step process for diagnosing node health problems—such as NotReady, MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable—by examining node conditions, reviewing events, checking system resources, inspecting component logs, applying targeted fixes, and verifying recovery, all illustrated with real‑world commands and examples.
Introduction
For operations engineers, the most dreaded event is a node suddenly becoming "NotReady," causing pod eviction, scheduling failures, and a flood of alerts. In large clusters, hardware aging, configuration drift, and resource leaks increasingly expose such issues. This article uses a 50‑node production Kubernetes 1.28 cluster as a backdrop and presents a systematic troubleshooting workflow that covers NotReady, MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable scenarios.
1. Basic Knowledge: Node States and Component Roles
1.1 Node State Machine
Kubernetes nodes transition through NodeCreated → Registered → Ready → NotReady/Shutdown. After registration the node is Ready and can receive pods. When a problem occurs, kubelet updates the node Condition to NotReady or Unknown. A Shutdown state follows a drain operation.
Check the state with:
kubectl get node1.2 Node Condition Types
Each node reports six conditions. Example output of kubectl describe node shows the status of MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable, Ready, and ConfigOK. The meanings are:
MemoryPressure : Available memory below the eviction threshold (default 100 Mi).
DiskPressure : Disk or inode usage below the eviction threshold (default imagefs.available < 10%).
PIDPressure : PID count approaching pid_max (default 32768).
NetworkUnavailable : CNI plugin not configured correctly.
Ready : Node can receive pods; True means healthy.
ConfigOK : kubelet configuration is valid.
1.3 Key Component Responsibilities
kubelet : Core agent that registers the node, reports status, manages pod lifecycles, and triggers eviction when resources are low.
container runtime (containerd or dockerd): Pulls images and runs containers; failures affect pod startup but rarely mark the node NotReady.
kube-proxy : Maintains iptables/ipvs rules for Service load‑balancing; failures affect network connectivity but not node health.
CNI plugins (Flannel, Calico, Cilium, etc.): Provide pod networking; misconfiguration leads to NetworkUnavailable.
1.4 Node Troubleshooting Flow
List nodes and identify the unhealthy ones: kubectl get node -o wide.
Describe the problematic node to view Conditions and Events: kubectl describe node <node-name>.
If needed, SSH into the node for system‑level checks.
Perform resource, component, and log analysis based on the observed symptoms.
2. NotReady Diagnosis
2.1 Observation
List nodes with their latest condition status:
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name} {.status.conditions[-1].type} {.status.conditions[-1].status}
{end}'2.2 Check Node Events
kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'2.3 Common Root Cause 1 – kubelet Crash
Verify kubelet service status: systemctl status kubelet If it is inactive or failed, view recent logs: journalctl -u kubelet -n 200 --no-pager Typical reasons:
Out‑of‑Memory (OOM) : Look for OOM killer messages in dmesg. Fix by increasing memory limits or reducing pod count.
Configuration errors : Validate the kubelet config file with kubelet --config=/var/lib/kubelet/config.yaml --validate.
Certificate problems : Expired or corrupted TLS certs prevent API server communication.
After fixing, restart kubelet:
systemctl restart kubelet2.4 Common Root Cause 2 – kubelet Certificate Expiry
Check certificate dates:
openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -datesIf expired, choose one of the following:
kubeadm clusters: kubeadm alpha certs renew kubelet.conf or
kubeadm init phase kubeconfig kubelet --node-name=<node>.
Manual renewal: backup certs, delete them, and restart kubelet to trigger re‑issuance.
Auto‑rotation: simply restart kubelet if the cluster is configured for auto‑renewal.
2.5 Common Root Cause 3 – etcd Connectivity
Check component health: kubectl get componentstatuses Inspect etcd logs for latency or I/O bottlenecks and use iostat to verify disk performance.
2.6 Common Root Cause 4 – Network Misconfiguration
Validate network interfaces, routes, and CNI plugin status (e.g., systemctl status flannel). Ensure CNI interfaces (flannel.1, cali*) are up and routes to the pod network exist.
3. Resource‑Pressure Issues
3.1 MemoryPressure
When free memory falls below the threshold, kubelet sets MemoryPressure=True. Observe with:
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name} {.status.conditions[?(@.type=="MemoryPressure")].status}
{end}'On the node, run free -h and ps aux --sort=-%mem | head -20 to locate heavy processes. Cleanup steps include stopping unnecessary services, draining the node ( kubectl drain), or adding memory.
3.2 DiskPressure
Check disk usage with df -h and inode usage with df -i. Identify large directories using du -sh /var/* | sort -rh | head -10. Clean up with docker system prune -af, crictl rmi $(crictl images -q), log rotation, and removal of old kernels.
3.3 PIDPressure
Inspect PID usage:
cat /proc/sys/kernel/pid_max
ps aux | wc -lIf near the limit, find the process spawning most children (
ps -eo pid,ppid,comm,num_threads --sort=-num_threads | head -20) and either increase pid_max temporarily ( echo 65536 > /proc/sys/kernel/pid_max) or address the offending workload.
4. NetworkUnavailable Diagnosis
Query the condition:
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name} {.status.conditions[?(@.type=="NetworkUnavailable")].status}
{end}'Inspect CNI service status (e.g., systemctl status flannel) and configuration files under /etc/cni/net.d/. Restart the CNI daemon, verify the CNI interface (flannel.1, cali*), and test connectivity from a debug pod:
kubectl run -it --rm debug-pod --image=busybox --restart=Never -- sh
ping -c 3 8.8.8.8
ping -c 3 kubernetes.default.svc.cluster.local5. Component‑Level Faults
5.1 kubelet Issues
Check service status, recent logs ( journalctl -u kubelet -n 200), and configuration ( /var/lib/kubelet/config.yaml). Common problems include API‑server connectivity, cgroup driver mismatches, and hitting maxPods. Resolve by fixing the API endpoint, aligning cgroup drivers, or increasing maxPods.
5.2 containerd Runtime Problems
Verify systemctl status containerd and its socket /run/containerd/containerd.sock. Ensure the config file syntax is valid ( containerd config dump) and that the storage directory ( /var/lib/containerd) has enough space.
5.3 Docker Runtime Problems (if used)
Check Docker service, storage driver ( docker info | grep "Storage Driver"), and logs. Common issues are driver incompatibility and image‑pull failures.
5.4 kube-proxy Problems
Inspect the daemonset pods, logs, and iptables/ipvs rules. Restart the daemonset if rules are corrupted.
6. Quick‑Reference Command Cheat Sheet
6.1 Node Status
# List all nodes
kubectl get node -o wide
# Show all conditions
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}
{range .status.conditions[*]}\t{.type}:{.status} ({.message})
{end}{end}'
# Node events sorted by time
kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'6.2 System Resource Checks
# CPU/Memory
top
# Memory details
free -h
# Disk space and inode
df -h
df -i
# Disk I/O
iostat -x 1 5
# Network
ss -s
netstat -tulnp6.3 kubelet Diagnosis
# Service status
systemctl status kubelet
# Recent logs
journalctl -u kubelet -n 200 --no-pager
# Error lines only
journalctl -u kubelet -n 200 --no-pager | grep -iE "error|failed|warning"
# Config file
cat /var/lib/kubelet/config.yaml
# Certificate dates
openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -dates6.4 Container Runtime Diagnosis
# containerd
crictl ps -a
crictl images
# Docker (if present)
docker ps -a
docker images
docker logs <container-id>6.5 Network Diagnosis
# CNI config
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/10-flannel.conflist
# Interfaces
ip addr
# Routes
ip route
# Test connectivity
ping -c 3 8.8.8.8
curl -k https://<api-server>/healthz6.6 Pod Troubleshooting
# Pods on a node
kubectl get pods -o wide --field-selector spec.nodeName=<node>
# Pod events and logs
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns>
# Exec into pod
kubectl exec -it <pod> -n <ns> -- sh7. Risk Reminders
7.1 Drain Operations
Running kubectl drain evicts all pods and can cause service interruption. Ensure sufficient replicas, capacity on other nodes, and that the operation occurs during a low‑traffic window. Recommended flow:
# Cordon the node
kubectl cordon <node>
# Verify pods
kubectl get pods -o wide --field-selector spec.nodeName=<node>
# Drain when ready
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force
# After fix, uncordon
kubectl uncordon <node>7.2 Restarting kubelet
Restarting kubelet briefly marks the node NotReady. Verify other nodes have capacity and that pods can tolerate a short outage before proceeding.
7.3 Deleting kubelet Certificates
Only delete certificates if the cluster’s CA can re‑issue them. Always back up /var/lib/kubelet/pki first.
cp -r /var/lib/kubelet/pki /var/lib/kubelet/pki.bak7.4 Disk Cleanup
Never delete files required for node operation. Backup logs and verify mount points before removing data.
7.5 Kernel Parameter Changes
Changing /proc/sys or /etc/sysctl.conf can affect stability. Test changes in a staging environment and know how to revert.
8. Validation and Rollback
8.1 Node Recovery Verification
kubectl get node <node>
# Expect STATUS=Ready
kubectl describe node <node> | grep -A 20 "Conditions:"8.2 Pod Recovery Verification
kubectl get pods -o wide --field-selector spec.nodeName=<node>
# All pods should be Running
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns>8.3 Application Health Check
# Example health endpoint
curl -k https://<app-endpoint>/healthz
# Ensure no error logs
kubectl logs <app-pod> -n <ns> --tail=100 | grep -i error8.4 Rollback Scenarios
Failed drain: kubectl uncordon <node>.
Wrong config change: restore backup and restart kubelet.
Accidental certificate deletion: restore from backup and restart kubelet.
9. Preventive Measures
9.1 Resource Reservations
Add system and kube reserves to /var/lib/kubelet/config.yaml to protect kubelet and system daemons:
systemReserved:
cpu: 500m
memory: 1Gi
ephemeral-storage: 1Gi
kubeReserved:
cpu: 500m
memory: 1Gi
ephemeral-storage: 1GiRestart kubelet afterwards.
9.2 Monitoring & Alerts
Key Prometheus alerts include node readiness, MemoryPressure, DiskPressure, PIDPressure, kubelet uptime, and certificate expiry. Example rule snippets are provided in the original article.
9.3 Log Management
Configure logrotate for kubelet, container runtime, and kube‑proxy logs to prevent uncontrolled growth.
/var/log/kubelet.log {
daily
rotate 7
compress
missingok
notifempty
create 0644 root root
postrotate
systemctl reload kubelet > /dev/null 2>&1 || true
endscript
}9.4 Certificate Management
Enable automatic renewal (kubeadm) and regularly audit certificate expiry dates with openssl x509 -in … -noout -enddate.
9.5 Capacity Planning
Maintain ~20 % node capacity buffer, ensure critical workloads have at least three replicas on separate nodes, and avoid over‑committing a single node.
9.6 Regular Inspections
Weekly checks: node health, resource usage, certificate dates, log sizes, long‑running pod memory trends, and alert rule functionality.
10. Summary
The core of node failure handling is a clear, ordered troubleshooting path: observe the condition, review events, examine resources, inspect component logs, pinpoint the root cause, apply the appropriate fix, and finally verify recovery. By following the “One‑look‑status, two‑look‑events, three‑look‑resources, four‑look‑components” mantra and documenting each step, operators can minimize downtime and continuously improve cluster reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
