Essential Kubernetes Troubleshooting Checklist for Ops Engineers
This guide provides Kubernetes operators with a comprehensive, step‑by‑step troubleshooting manual covering pod, node, and cluster‑level issues, common pod states, exit‑code analysis, and practical commands such as kubectl describe, logs, top, and drain, enabling rapid diagnosis and resolution of K8s problems.
1. Pod‑related Issues
Cannot start: Use kubectl describe pod , kubectl logs , and kubectl get events to inspect pod status, container logs, and events.
Cannot connect to other services: Inside the pod, use ping or telnet to test network connectivity and check NetworkPolicy configuration and target service status.
Slow or abnormal operation: Run kubectl top pod to view resource usage, then inspect processes and logs inside the container.
Unschedulable: Examine pod scheduling details, node resource usage, and label/annotation matching.
Persistent Pending state: Review pod status and events, verify manifest correctness, and check resource quotas, volumes, and scheduling policies.
Cannot access external services: Verify pod DNS settings, Service objects in the namespace, network permissions, and network policies.
Container exits immediately or fails to run the application: Check pod events and logs, examine container image, environment variables, entry scripts, application configuration, and dependencies.
Service unreachable: Inspect CoreDNS, DNS config files, service ports, service‑pod association, CNI components, kube‑proxy, and iptables/ipvs rules.
2. Node‑related Issues
Abnormal node status: Use kubectl get nodes , kubectl describe node , and kubectl get pods -o wide --all-namespaces to view node health, details, and pod distribution.
Pod cannot access network: Check node information, the node on which the pod runs, and container logs.
Pod cannot access storage: Review pod volume configuration, enter the container to access the mounted filesystem, and examine PVC status.
Volume mount failure: Verify pod volume settings, PVC status, network storage connectivity, and storage server health.
Node joins cluster but pods are not scheduled: Ensure taints/tolerations match pod selectors, check node resource usage, and confirm API server connectivity.
PersistentVolume mount failure: Check PV‑PVC matching, storageClassName consistency, node storage configuration, and automatic provisioning permissions.
3. Cluster‑level Issues
Many pods run slowly: Check resource usage across all pods and nodes, and review container logs.
Service unavailable: Inspect the status of related pods, network connectivity, storage access, and service configuration.
Node‑pod imbalance: Review node and pod statuses, pod resource consumption, and affinity/anti‑affinity rules.
Node crash: Examine node health, use kubectl drain to evict pods, and redeploy them to other nodes.
Kubernetes API Server down: Verify cluster health, ensure API Server and kubelet versions match, and check API Server logs.
Kubernetes command execution failure: Check API Server availability, user permissions, and kubeconfig credentials.
Master node unavailable: Inspect kube‑apiserver, kube‑scheduler, kube‑controller‑manager, and etcd status; restart kubelet and container runtime if needed.
Bypass LoadBalancer to access pod directly: Ensure Service uses ClusterIP, and that the Service selector correctly matches the target pod.
Deployment auto‑update failure: Verify update strategy, API Server‑kubelet connectivity, and pod definitions.
Status check errors: Review node logs and events, confirm compatibility with kubelet version, and consider component upgrades.
Incorrect RBAC configuration: Validate RoleBinding and ClusterRoleBinding definitions, and ensure user/service‑account roles are correct.
Unable to connect to etcd storage: Confirm etcd is running, check API Server etcd connection settings, and test manual etcd connectivity.
4. Common Pod State Troubleshooting
Pending: Review pod events to identify unschedulable reasons such as insufficient resources or occupied HostPort.
Waiting / ContainerCreating: Check events for image pull failures, CNI network errors, or container start issues.
ImagePullBackOff: Verify image name and private registry credentials; use docker pull and create the appropriate Secret.
CrashLoopBackOff: Container started then exited; examine logs and commands to find the cause (e.g., process crash or failed health check).
Error: May stem from missing dependencies, resource limits, security policy violations, or insufficient permissions.
Terminating / Unknown: Delete the affected node, restore node health, or force‑delete the pod with caution.
5. Container Exit Code Analysis
The guide explains the meaning of various pod statuses such as CrashLoopBackOff and InvalidImageName, and maps exit‑code ranges to common scenarios: EXIT CODE 0 indicates normal termination, while EXIT CODE 137 shows the container received a SIGKILL signal, among others.
This manual offers Kubernetes operators a systematic troubleshooting framework covering pod, node, and cluster‑level problems, helping quickly locate and resolve issues to maintain stable cluster operation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
