Cloud Native 37 min read

Kubernetes Outage? Essential Troubleshooting Guide for Production Clusters

A comprehensive, step‑by‑step guide that explains the most common Kubernetes failure scenarios—from pod crashes and image pull errors to node NotReady and API server timeouts—provides concrete kubectl commands, diagnostic scripts, real‑world case studies, best‑practice recommendations, monitoring metrics, and backup‑restore procedures to keep production clusters healthy.

Raymond Ops
Raymond Ops
Raymond Ops
Kubernetes Outage? Essential Troubleshooting Guide for Production Clusters

Overview

Kubernetes clusters frequently encounter failures such as pods that cannot start, services that are unreachable, nodes that become NotReady, and control‑plane components that stop responding. This guide organizes the troubleshooting workflow into a clear "phenomenon → location → solution" pattern and focuses on practical commands and scripts rather than theory.

Layered troubleshooting methodology

Node layer – check kubelet, disk, memory, PID pressure.

Control‑plane layer – API server, etcd, scheduler, controller‑manager.

Workload layer – pod status, container logs, readiness/liveness probes.

Network layer – Service, Endpoint, DNS, CNI plugins.

Common failure scenarios

Pod failures

CrashLoopBackOff : the container exits repeatedly, restart back‑off grows exponentially. Use kubectl describe pod <pod> and kubectl logs <pod> --previous to view the last crash log, then check exit codes.

ImagePullBackOff : image name or tag is wrong, authentication missing, or network unreachable. Diagnose with kubectl describe pod <pod>, kubectl get events, and crictl pull <image>. Fix by correcting the image reference, creating the proper imagePullSecrets, or ensuring network connectivity.

Pending : pod cannot be scheduled. Common reasons include insufficient CPU, node taints without tolerations, or unsatisfied node affinity. Identify the problematic node with kubectl describe pod <pod> and examine

kubectl get nodes -o jsonpath='{.items[*].status.allocatable}'

.

# Example commands for pod troubleshooting
kubectl get pod my-pod -o wide
kubectl describe pod my-pod
kubectl logs my-pod --previous
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

Network failures

Service unreachable : verify the Service and its Endpoints exist, ensure selector matches pod labels, and confirm targetPort matches the container port. Test from inside the cluster with a temporary curl pod.

DNS resolution failure : check CoreDNS pod health, inspect the ConfigMap, and verify each pod's /etc/resolv.conf. High query volume can overload CoreDNS; scaling the deployment or enabling NodeLocal DNSCache mitigates the issue.

# Service check
kubectl get svc my-svc -n my-ns
kubectl get endpoints my-svc -n my-ns
kubectl run test-curl --rm -it --image=curlimages/curl -- curl -v http://my-svc.my-ns.svc:80/healthz

# DNS check
kubectl run dns-test --rm -it --image=busybox -- nslookup kubernetes.default
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

Node failures

NotReady : inspect node conditions with kubectl describe node <node>. Common causes are kubelet stopped, memory pressure, disk pressure, PID pressure, or network plugin failure.

DiskPressure : run df -h on the node, locate large directories (e.g., /var/log), and clean up or expand storage.

# Node status
kubectl describe node node-1 | grep -A 20 'Conditions:'
# Disk usage
ssh node-1 df -h
ssh node-1 du -sh /var/log/* | sort -rh | head -5

Control‑plane failures

API Server unresponsive : check the API Server pod status, verify the port 6443 is listening, and test etcd health.

Certificate expiration : run kubeadm certs check-expiration to list expiry dates, then renew with kubeadm certs renew all and restart static pod manifests.

# API Server check
kubectl get pods -n kube-system -l component=kube-apiserver
kubectl logs -n kube-system -l component=kube-apiserver --tail=50

# Certificate check
kubeadm certs check-expiration
kubeadm certs renew all

Diagnostic scripts

The guide provides ready‑to‑use Bash scripts for cluster health checks, pod‑level diagnosis, and node‑resource inspection. They output colored summaries, list abnormal pods, highlight high‑restart pods, and report etcd size.

#!/bin/bash
# k8s-health-check.sh – full cluster health report
RED='\033[0;31m'
GREEN='\033[0;32m'
NC='\033[0m'
echo "=== Node status ==="
kubectl get nodes
# ... additional sections omitted for brevity ...

Case studies

Case 1 – Massive eviction : a production namespace experienced 30% pod eviction because a single container wrote a 78 GB log file, filling the node’s disk. The solution was to delete evicted pods, enable kubelet log rotation, and set ephemeral-storage requests/limits.

Case 2 – Intermittent DNS failures : 5 % of requests returned "Name or service not known". CoreDNS CPU usage was near its limit, so the deployment was scaled from 2 to 5 replicas and CPU limits were increased. Enabling NodeLocal DNSCache further reduced latency.

Best practices and precautions

Set realistic resources.requests and resources.limits (requests ≈ 80 % of observed usage, limits 1.5‑2× requests).

Configure PodDisruptionBudget to protect services during node maintenance.

Adjust terminationGracePeriodSeconds for applications that need longer shutdown windows.

Regularly audit certificates and automate weekly expiration checks.

Back up etcd daily and retain at least seven snapshots; test restore procedures regularly.

Restrict kubectl RBAC permissions to the minimum required for each team.

Monitoring and alerting

Key metrics include node CPU/memory usage, etcd database size, API server request latency, and CoreDNS query latency. Example Prometheus rules are provided for NodeNotReady, high CPU/memory, disk pressure, pod crash loops, OOM kills, and etcd disk latency.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: k8s-cluster-alerts
  namespace: monitoring
spec:
  groups:
  - name: k8s-node.rules
    rules:
    - alert: NodeNotReady
      expr: kube_node_status_condition{condition="Ready",status="true"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.node }} NotReady for >5m"
    - alert: NodeHighCPU
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Node {{ $labels.instance }} CPU >85%"
# ... additional rules omitted ...

Backup and recovery

Daily etcd snapshots are created with etcdctl snapshot save. The script stores snapshots under /opt/etcd-backup, logs success, and purges files older than seven days. Restoration steps include stopping static pod manifests, moving the current data directory, restoring from the snapshot, and bringing the control‑plane pods back online.

#!/bin/bash
# etcd-daily-backup.sh
ETCD_ENDPOINTS="https://127.0.0.1:2379"
BACKUP_DIR="/opt/etcd-backup"
SNAPSHOT="$BACKUP_DIR/etcd-$(date +%Y%m%d-%H%M%S).db"
etcdctl --endpoints=$ETCD_ENDPOINTS \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save $SNAPSHOT
if [ $? -eq 0 ]; then
  echo "Backup succeeded: $SNAPSHOT"
  find $BACKUP_DIR -name 'etcd-*.db' -mtime +7 -delete
else
  echo "Backup failed!" >&2
  exit 1
fi

Conclusion

The layered approach—starting from node health, moving up through the control plane, then workloads and network—provides the fastest path to root cause identification. Regular health‑check scripts, proactive monitoring, and disciplined backup/restore practices are essential to keep Kubernetes clusters reliable and to reduce MTTR during incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringkubernetesbest practicestroubleshootingbackupetcdCluster OperationsPod Debugging
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.