Operations 44 min read

Kubernetes Node Failures: One‑Stop Guide to Diagnose and Fix Common Issues

This comprehensive guide walks Kubernetes operators through a step‑by‑step process for diagnosing node health problems—such as NotReady, MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable—by examining node conditions, reviewing events, checking system resources, inspecting component logs, applying targeted fixes, and verifying recovery, all illustrated with real‑world commands and examples.

Ops Community
Ops Community
Ops Community
Kubernetes Node Failures: One‑Stop Guide to Diagnose and Fix Common Issues

Introduction

For operations engineers, the most dreaded event is a node suddenly becoming "NotReady," causing pod eviction, scheduling failures, and a flood of alerts. In large clusters, hardware aging, configuration drift, and resource leaks increasingly expose such issues. This article uses a 50‑node production Kubernetes 1.28 cluster as a backdrop and presents a systematic troubleshooting workflow that covers NotReady, MemoryPressure, DiskPressure, PIDPressure, and NetworkUnavailable scenarios.

1. Basic Knowledge: Node States and Component Roles

1.1 Node State Machine

Kubernetes nodes transition through NodeCreated → Registered → Ready → NotReady/Shutdown. After registration the node is Ready and can receive pods. When a problem occurs, kubelet updates the node Condition to NotReady or Unknown. A Shutdown state follows a drain operation.

Check the state with:

kubectl get node

1.2 Node Condition Types

Each node reports six conditions. Example output of kubectl describe node shows the status of MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable, Ready, and ConfigOK. The meanings are:

MemoryPressure : Available memory below the eviction threshold (default 100 Mi).

DiskPressure : Disk or inode usage below the eviction threshold (default imagefs.available < 10%).

PIDPressure : PID count approaching pid_max (default 32768).

NetworkUnavailable : CNI plugin not configured correctly.

Ready : Node can receive pods; True means healthy.

ConfigOK : kubelet configuration is valid.

1.3 Key Component Responsibilities

kubelet : Core agent that registers the node, reports status, manages pod lifecycles, and triggers eviction when resources are low.

container runtime (containerd or dockerd): Pulls images and runs containers; failures affect pod startup but rarely mark the node NotReady.

kube-proxy : Maintains iptables/ipvs rules for Service load‑balancing; failures affect network connectivity but not node health.

CNI plugins (Flannel, Calico, Cilium, etc.): Provide pod networking; misconfiguration leads to NetworkUnavailable.

1.4 Node Troubleshooting Flow

List nodes and identify the unhealthy ones: kubectl get node -o wide.

Describe the problematic node to view Conditions and Events: kubectl describe node <node-name>.

If needed, SSH into the node for system‑level checks.

Perform resource, component, and log analysis based on the observed symptoms.

2. NotReady Diagnosis

2.1 Observation

List nodes with their latest condition status:

kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}	{.status.conditions[-1].type}	{.status.conditions[-1].status}
{end}'

2.2 Check Node Events

kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'

2.3 Common Root Cause 1 – kubelet Crash

Verify kubelet service status: systemctl status kubelet If it is inactive or failed, view recent logs: journalctl -u kubelet -n 200 --no-pager Typical reasons:

Out‑of‑Memory (OOM) : Look for OOM killer messages in dmesg. Fix by increasing memory limits or reducing pod count.

Configuration errors : Validate the kubelet config file with kubelet --config=/var/lib/kubelet/config.yaml --validate.

Certificate problems : Expired or corrupted TLS certs prevent API server communication.

After fixing, restart kubelet:

systemctl restart kubelet

2.4 Common Root Cause 2 – kubelet Certificate Expiry

Check certificate dates:

openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -dates

If expired, choose one of the following:

kubeadm clusters: kubeadm alpha certs renew kubelet.conf or

kubeadm init phase kubeconfig kubelet --node-name=<node>

.

Manual renewal: backup certs, delete them, and restart kubelet to trigger re‑issuance.

Auto‑rotation: simply restart kubelet if the cluster is configured for auto‑renewal.

2.5 Common Root Cause 3 – etcd Connectivity

Check component health: kubectl get componentstatuses Inspect etcd logs for latency or I/O bottlenecks and use iostat to verify disk performance.

2.6 Common Root Cause 4 – Network Misconfiguration

Validate network interfaces, routes, and CNI plugin status (e.g., systemctl status flannel). Ensure CNI interfaces (flannel.1, cali*) are up and routes to the pod network exist.

3. Resource‑Pressure Issues

3.1 MemoryPressure

When free memory falls below the threshold, kubelet sets MemoryPressure=True. Observe with:

kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}	{.status.conditions[?(@.type=="MemoryPressure")].status}
{end}'

On the node, run free -h and ps aux --sort=-%mem | head -20 to locate heavy processes. Cleanup steps include stopping unnecessary services, draining the node ( kubectl drain), or adding memory.

3.2 DiskPressure

Check disk usage with df -h and inode usage with df -i. Identify large directories using du -sh /var/* | sort -rh | head -10. Clean up with docker system prune -af, crictl rmi $(crictl images -q), log rotation, and removal of old kernels.

3.3 PIDPressure

Inspect PID usage:

cat /proc/sys/kernel/pid_max
ps aux | wc -l

If near the limit, find the process spawning most children (

ps -eo pid,ppid,comm,num_threads --sort=-num_threads | head -20

) and either increase pid_max temporarily ( echo 65536 > /proc/sys/kernel/pid_max) or address the offending workload.

4. NetworkUnavailable Diagnosis

Query the condition:

kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}	{.status.conditions[?(@.type=="NetworkUnavailable")].status}
{end}'

Inspect CNI service status (e.g., systemctl status flannel) and configuration files under /etc/cni/net.d/. Restart the CNI daemon, verify the CNI interface (flannel.1, cali*), and test connectivity from a debug pod:

kubectl run -it --rm debug-pod --image=busybox --restart=Never -- sh
ping -c 3 8.8.8.8
ping -c 3 kubernetes.default.svc.cluster.local

5. Component‑Level Faults

5.1 kubelet Issues

Check service status, recent logs ( journalctl -u kubelet -n 200), and configuration ( /var/lib/kubelet/config.yaml). Common problems include API‑server connectivity, cgroup driver mismatches, and hitting maxPods. Resolve by fixing the API endpoint, aligning cgroup drivers, or increasing maxPods.

5.2 containerd Runtime Problems

Verify systemctl status containerd and its socket /run/containerd/containerd.sock. Ensure the config file syntax is valid ( containerd config dump) and that the storage directory ( /var/lib/containerd) has enough space.

5.3 Docker Runtime Problems (if used)

Check Docker service, storage driver ( docker info | grep "Storage Driver"), and logs. Common issues are driver incompatibility and image‑pull failures.

5.4 kube-proxy Problems

Inspect the daemonset pods, logs, and iptables/ipvs rules. Restart the daemonset if rules are corrupted.

6. Quick‑Reference Command Cheat Sheet

6.1 Node Status

# List all nodes
kubectl get node -o wide
# Show all conditions
kubectl get node -o jsonpath='{range .items[*]}{.metadata.name}
{range .status.conditions[*]}\t{.type}:{.status} ({.message})
{end}{end}'
# Node events sorted by time
kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'

6.2 System Resource Checks

# CPU/Memory
top
# Memory details
free -h
# Disk space and inode
df -h
df -i
# Disk I/O
iostat -x 1 5
# Network
ss -s
netstat -tulnp

6.3 kubelet Diagnosis

# Service status
systemctl status kubelet
# Recent logs
journalctl -u kubelet -n 200 --no-pager
# Error lines only
journalctl -u kubelet -n 200 --no-pager | grep -iE "error|failed|warning"
# Config file
cat /var/lib/kubelet/config.yaml
# Certificate dates
openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -noout -dates

6.4 Container Runtime Diagnosis

# containerd
crictl ps -a
crictl images
# Docker (if present)
docker ps -a
docker images
docker logs <container-id>

6.5 Network Diagnosis

# CNI config
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/10-flannel.conflist
# Interfaces
ip addr
# Routes
ip route
# Test connectivity
ping -c 3 8.8.8.8
curl -k https://<api-server>/healthz

6.6 Pod Troubleshooting

# Pods on a node
kubectl get pods -o wide --field-selector spec.nodeName=<node>
# Pod events and logs
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns>
# Exec into pod
kubectl exec -it <pod> -n <ns> -- sh

7. Risk Reminders

7.1 Drain Operations

Running kubectl drain evicts all pods and can cause service interruption. Ensure sufficient replicas, capacity on other nodes, and that the operation occurs during a low‑traffic window. Recommended flow:

# Cordon the node
kubectl cordon <node>
# Verify pods
kubectl get pods -o wide --field-selector spec.nodeName=<node>
# Drain when ready
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --force
# After fix, uncordon
kubectl uncordon <node>

7.2 Restarting kubelet

Restarting kubelet briefly marks the node NotReady. Verify other nodes have capacity and that pods can tolerate a short outage before proceeding.

7.3 Deleting kubelet Certificates

Only delete certificates if the cluster’s CA can re‑issue them. Always back up /var/lib/kubelet/pki first.

cp -r /var/lib/kubelet/pki /var/lib/kubelet/pki.bak

7.4 Disk Cleanup

Never delete files required for node operation. Backup logs and verify mount points before removing data.

7.5 Kernel Parameter Changes

Changing /proc/sys or /etc/sysctl.conf can affect stability. Test changes in a staging environment and know how to revert.

8. Validation and Rollback

8.1 Node Recovery Verification

kubectl get node <node>
# Expect STATUS=Ready
kubectl describe node <node> | grep -A 20 "Conditions:"

8.2 Pod Recovery Verification

kubectl get pods -o wide --field-selector spec.nodeName=<node>
# All pods should be Running
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns>

8.3 Application Health Check

# Example health endpoint
curl -k https://<app-endpoint>/healthz
# Ensure no error logs
kubectl logs <app-pod> -n <ns> --tail=100 | grep -i error

8.4 Rollback Scenarios

Failed drain: kubectl uncordon <node>.

Wrong config change: restore backup and restart kubelet.

Accidental certificate deletion: restore from backup and restart kubelet.

9. Preventive Measures

9.1 Resource Reservations

Add system and kube reserves to /var/lib/kubelet/config.yaml to protect kubelet and system daemons:

systemReserved:
  cpu: 500m
  memory: 1Gi
  ephemeral-storage: 1Gi
kubeReserved:
  cpu: 500m
  memory: 1Gi
  ephemeral-storage: 1Gi

Restart kubelet afterwards.

9.2 Monitoring & Alerts

Key Prometheus alerts include node readiness, MemoryPressure, DiskPressure, PIDPressure, kubelet uptime, and certificate expiry. Example rule snippets are provided in the original article.

9.3 Log Management

Configure logrotate for kubelet, container runtime, and kube‑proxy logs to prevent uncontrolled growth.

/var/log/kubelet.log {
    daily
    rotate 7
    compress
    missingok
    notifempty
    create 0644 root root
    postrotate
        systemctl reload kubelet > /dev/null 2>&1 || true
    endscript
}

9.4 Certificate Management

Enable automatic renewal (kubeadm) and regularly audit certificate expiry dates with openssl x509 -in … -noout -enddate.

9.5 Capacity Planning

Maintain ~20 % node capacity buffer, ensure critical workloads have at least three replicas on separate nodes, and avoid over‑committing a single node.

9.6 Regular Inspections

Weekly checks: node health, resource usage, certificate dates, log sizes, long‑running pod memory trends, and alert rule functionality.

10. Summary

The core of node failure handling is a clear, ordered troubleshooting path: observe the condition, review events, examine resources, inspect component logs, pinpoint the root cause, apply the appropriate fix, and finally verify recovery. By following the “One‑look‑status, two‑look‑events, three‑look‑resources, four‑look‑components” mantra and documenting each step, operators can minimize downtime and continuously improve cluster reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesCNIkubeletNode troubleshootingDiskPressureNotReadyMemoryPressurePIDPressure
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.