Operations 9 min read

Kubernetes Troubleshooting Handbook: Diagnose Pods, Nodes & Clusters Fast

This handbook provides Kubernetes operators with a comprehensive, step‑by‑step troubleshooting framework covering common Pod issues, Node problems, and cluster‑wide failures, offering practical commands, diagnostic tips, and explanations of error states to quickly identify and resolve stability challenges in K8s environments.

Linux Cloud Computing Practice
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Kubernetes Troubleshooting Handbook: Diagnose Pods, Nodes & Clusters Fast

This K8S Operations Troubleshooting Handbook is a practical guide to help operators quickly locate and resolve various issues in Kubernetes clusters.

1. Pod‑related Issues

Cannot start: Use kubectl describe pod , kubectl logs , and kubectl get events to inspect pod status, container logs, and events.

Cannot connect to other services: Inside the pod, use ping or telnet to test network connectivity, and check NetworkPolicy and target service configurations.

Slow or abnormal operation: Run kubectl top pod to view resource usage, and inspect processes or logs inside the container.

Unschedulable on a node: Review pod scheduling details and verify node resources, labels, and taints.

Persistent Pending state: Check pod status and events, confirm manifest correctness, and verify resource quotas, volumes, and scheduling policies.

Cannot access external services: Examine pod DNS settings, Service definitions, network permissions, and network policies.

Pod exits immediately or fails to run the application: Review events and logs, and verify image, environment variables, entry scripts, configuration files, and dependencies.

Service unreachable: Inspect CoreDNS, DNS config, service ports, service‑pod bindings, pod health, CNI components, kube‑proxy rules, or IPVS routing.

2. Node‑related Issues

Abnormal node status: Use kubectl get nodes , kubectl describe node , and kubectl get pods -o wide --all-namespaces to view node health and pod distribution.

Pod cannot reach network: Check node information and container logs.

Pod cannot access storage: Verify pod volume configuration, PVC status, and mounted filesystem.

Volume mount failure: Confirm PVC status, network storage connectivity, and storage server health.

Node joins cluster but pods are not scheduled: Examine taints, tolerations, selectors, resource usage, and API server connectivity.

PersistentVolume mount failure: Ensure PV‑PVC matching, storage class consistency, and node storage configuration.

3. Cluster‑level Issues

Many pods run slowly: Review resource usage of all pods and nodes, and check container logs.

Service unavailable: Inspect related pod status, network connectivity, storage access, and service configuration.

Node‑pod imbalance: Examine node and pod statuses, resource consumption, and affinity/anti‑affinity rules.

Node crash: Use kubectl drain to evict pods, then perform maintenance or replace hardware before removing the node.

API Server down: Verify cluster health, API Server and kubelet version compatibility, and check server logs.

Kubectl command failures: Check API Server availability, user permissions, and kubeconfig credentials.

Master node unavailable: Inspect kube‑apiserver, scheduler, controller‑manager, etcd health, and restart kubelet and container runtime on the master.

Bypass LoadBalancer to access pod directly: Ensure Service type is ClusterIP and selector matches the target pod.

Deployment auto‑update fails: Validate update strategy, API Server/kubelet connectivity, and pod definitions.

Health check errors: Review node logs, confirm compatibility with kubelet version, and consider component upgrades.

Authorization misconfiguration: Verify RoleBinding/ClusterRoleBinding definitions and user/service‑account permissions.

Cannot connect to etcd storage: Ensure etcd is running and API Server etcd connection settings are correct; test manual etcd access.

4. Common Pod Status Troubleshooting

Pending: Examine pod events to determine unschedulable reasons such as insufficient resources or occupied HostPort.

Waiting or ContainerCreating: Check events for image pull failures, CNI network errors, or container startup issues.

ImagePullBackOff: Verify image name and private registry credentials; test with docker pull and create appropriate Secrets.

CrashLoopBackOff: Container started then exited; review logs and exec into the container to find the cause (e.g., process crash or failed liveness probe).

Error: May stem from missing dependencies, resource limits, security policy violations, or insufficient permissions.

Terminating or Unknown: Delete the associated node, restore node health, or force‑delete the pod with caution.

5. Analyzing Container Exit Codes

The guide explains the meaning of various pod states such as CrashLoopBackOff and InvalidImageName, and details common container exit code ranges—for example, EXIT CODE 0 indicates normal termination, while EXIT CODE 137 shows the container received a SIGKILL signal.

This handbook equips Kubernetes operators with a systematic troubleshooting methodology covering pod, node, and cluster‑level issues, enabling rapid problem identification and resolution to maintain stable K8s environments.

operationsKubernetesClusterK8s
Linux Cloud Computing Practice
Written by

Linux Cloud Computing Practice

Welcome to Linux Cloud Computing Practice. We offer high-quality articles on Linux, cloud computing, DevOps, networking and related topics. Dive in and start your Linux cloud computing journey!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.