Mastering Kubernetes: 30+ Essential Pod, Node, and Cluster Troubleshooting Techniques
This guide compiles over thirty practical Kubernetes troubleshooting steps, covering pod startup failures, networking issues, resource bottlenecks, node abnormalities, cluster‑wide service problems, and detailed explanations of common container exit codes to help operators quickly diagnose and resolve issues.
Pod‑related problems and diagnostics
Pod cannot start : use kubectl describe pod <pod_name> -n <namespace> to view status, events, and container conditions.
Pod cannot connect to other services : exec into the pod with
kubectl exec -it <pod_name> -n <namespace> -- /bin/bashand test connectivity (ping, telnet); check NetworkPolicy via kubectl describe pod and service configuration with kubectl describe service <service_name>.
Pod runs slowly or abnormally : inspect resource usage with kubectl top pod <pod_name> -n <namespace>, examine processes inside the container, and review logs.
Pod cannot be scheduled : describe the pod to see scheduling failures, check node resources with kubectl get nodes and kubectl describe node <node_name>, and verify label/selector matching.
Pod stays in Pending : confirm pod spec correctness, ensure required node types (e.g., GPU) exist, verify resource quotas, and check storage volume availability.
Pod cannot access external services : verify DNS configuration, service existence in the namespace, network permissions, node egress rules, and network policies.
Pod exits immediately after start : check events ( kubectl describe pod), logs ( kubectl logs), container image, environment variables, and entrypoint scripts; reproduce locally with docker run <image>.
Pod runs but the application fails : review application logs, pod events, configuration files, dependencies, resource limits, and test the image locally.
Service unreachable in the cluster : verify CoreDNS, DNS config (/etc/resolv.conf), service ports, pod‑service binding, node health, CNI plugins (flannel, calico), kube‑proxy, and iptables/IPVS rules. kubectl get svc -n <namespace> Pod in CrashLoopBackOff : view logs ( kubectl logs <pod> and kubectl logs --previous <pod>), check recent events, and exec into the container for deeper inspection.
Pod internal service or network issues : ensure the associated Service exists, ports match, DNS resolves, and test connectivity with kubectl exec.
Pod storage problems : describe the pod and PVC, verify volume binding, exec into the pod to list files, and confirm storage class and access mode compatibility.
Node‑related problems and diagnostics
Node abnormal state : kubectl get nodes and kubectl describe node <node_name> reveal hardware usage and possible bottlenecks; list pods per node with kubectl get pods -o wide --all-namespaces.
Pods on a node cannot reach the network : describe the node, check pod‑node association, and review pod logs for errors.
Node‑pod storage failures : inspect pod volume definitions, exec into the pod to test file access, and describe the PVC for status.
Node cannot schedule new pods : verify taints/tolerations, resource availability, and ensure the node can communicate with the API server.
PersistentVolume mount failures : confirm PV‑PVC matching, storage class consistency, and underlying storage system health.
Cluster‑level issues and diagnostics
Many pods run slowly : use kubectl top pod -n <ns> and kubectl top node to spot resource bottlenecks; check pod logs for errors.
Specific service unavailable : list pods of the service, describe the pod and service, and verify network and storage configurations.
Node‑pod imbalance : compare node and pod distribution with kubectl get nodes and kubectl get pods -o wide --all-namespaces; examine affinity/anti‑affinity rules.
Node failure handling : cordon the node ( kubectl cordon <node>), drain it ( kubectl drain <node> --ignore-daemonsets), then delete if necessary.
API server outage : run kubectl cluster-info, check version compatibility, and inspect the kube‑apiserver service status.
Kubernetes command failures : verify API server reachability, user permissions ( kubectl auth can-i), and kubeconfig correctness.
Master node down : ensure kube‑apiserver, scheduler, controller‑manager, and etcd are running.
LoadBalancer bypassed : confirm Service type (ClusterIP) and selector correctness.
Deployment rollout failures : check rolling update strategy, API server connectivity, and pod definitions.
Cluster health checks failing : review node logs, kubelet version compatibility, and upgrade components if needed.
RBAC misconfiguration : audit RoleBinding/ClusterRoleBinding and service‑account permissions.
etcd connectivity issues : verify etcd health, API server etcd endpoints, and test with etcdctl cluster-health.
Common pod status troubleshooting
Typical commands to inspect any pod:
kubectl get pod <pod-name> -o yaml kubectl describe pod <pod-name> -n <namespace> kubectl logs <pod-name> [-c <container-name>]These commands reveal configuration errors, events, and container logs that are essential for root‑cause analysis.
Pending : pod not scheduled; causes include insufficient resources, occupied HostPort, or node taints.
Waiting / ContainerCreating : image pull failures, CNI network errors, or misconfigured container parameters.
ImagePullBackOff : incorrect image name or missing secret for private registries; test with docker pull <image> and create a docker-registry secret if needed.
CrashLoopBackOff : container exits repeatedly; check logs, previous logs, and exec into the container for deeper clues.
Error : missing ConfigMap/Secret/PV, resource limits exceeded, or security policy violations.
Terminating / Unknown : node lost; options are node deletion, node recovery, or forced pod deletion with kubectl delete pod <pod> --grace-period=0 --force.
Analyzing container exit codes
Exit codes range from 0‑255. Code 0 means normal termination; 1‑128 indicate application errors; 129‑255 represent signals from the OS (e.g., 137 = SIGKILL, 139 = SIGSEGV, 143 = SIGTERM). Negative codes are converted using 256 - (|code| % 256), while positive codes use code % 256.
0 : normal exit, often used by jobs.
1 : generic program error (e.g., division by zero, missing file).
137 : container killed (SIGKILL), frequently caused by OOMKilled.
139 : segmentation fault (SIGSEGV).
143 : termination signal (SIGTERM), usually from docker stop.
126 : permission or non‑executable command.
127 : command not found.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
