Operations 10 min read

Essential Kubernetes Ops Checklist: Diagnose and Fix Common Cluster Issues

This guide provides a comprehensive, step‑by‑step troubleshooting checklist for Kubernetes operations, covering Pod, Node, and cluster‑level problems, common pod status anomalies, and container exit‑code analysis to help operators quickly locate and resolve issues.

Linux Cloud Computing Practice

Jun 6, 2025

Essential Kubernetes Ops Checklist: Diagnose and Fix Common Cluster Issues

The "K8S Operations Essential Troubleshooting Manual" is a practical guide designed to help operators quickly locate and resolve various problems in Kubernetes (K8S) clusters. Below is a summary of its core content:

1. Pod‑Related Issue Troubleshooting

Cannot start: Use kubectl describe pod , kubectl logs , kubectl get events to examine pod status, container logs, and events.

Cannot connect to other services: Enter the pod and test network connectivity with ping or telnet , check NetworkPolicy and target service configuration.

Slow or abnormal operation: Run kubectl top pod to view resource usage, inspect processes inside the container, or check container logs.

Cannot be scheduled to a node: Review pod scheduling details, node resource usage, and label/annotation matching.

Stays in Pending: Check pod status and events, verify manifest correctness, and examine resource quotas, volumes, and scheduling policies.

Cannot access external services: Verify pod DNS configuration, Service objects in the namespace, network access permissions, and network policies.

Exits immediately or fails to run the application: Review pod events and logs, check container image, environment variables, entry scripts, application config files, and dependencies.

Service unreachable: Inspect CoreDNS, DNS config files, service ports, service‑to‑pod association, pod health, CNI components, kube‑proxy, and iptables/ipvs rules.

2. Node‑Related Issue Troubleshooting

Abnormal status: Use kubectl get nodes , kubectl describe node , and kubectl get pods -o wide --all-namespaces to view node status, details, and pod distribution.

Pod cannot access network: Check node information, the node on which the pod runs, and container logs.

Pod cannot access storage: Examine pod volume configuration, enter the container to view mounted filesystem, and verify PVC status.

Volume mount failure: Verify pod volume settings, PVC status, network storage connectivity, and storage server health.

Node joins cluster but cannot schedule pods: Check node taints and tolerations, pod selectors, node resource usage, and API server connectivity.

PersistentVolume mount failure: Confirm PV‑PVC matching, storageClassName alignment, node storage configuration, and provisioning permissions.

3. Cluster‑Level Issue Troubleshooting

Many pods run slowly: Review resource usage of all pods and nodes, and check container logs.

Service unavailable: Examine related pod status, network connectivity, storage access, and service configuration.

Node‑Pod imbalance: Inspect node and pod status, pod resource consumption, and affinity/anti‑affinity policies.

Node failure: Use kubectl drain to evict pods, then perform maintenance or replacement.

Kubernetes API Server unavailable: Verify cluster status, API Server and kubelet version compatibility, and system logs.

Kubernetes command execution fails: Check API server availability, user permissions, and kubeconfig credentials.

Master node down: Inspect kube‑apiserver, kube‑scheduler, kube‑controller‑manager, etcd health, and restart kubelet and container runtime on the master.

Bypass LoadBalancer to access pod directly: Ensure Service type is ClusterIP and selector matches the target pod.

Deployment auto‑update failure: Verify update strategy, API server‑kubelet connectivity, and pod definition correctness.

Health‑check errors: Review node logs and events, confirm compatibility with kubelet version, and upgrade components if needed.

Authorization misconfiguration: Validate RoleBinding and ClusterRoleBinding definitions, bound roles, and kubeconfig user permissions.

Cannot connect to etcd storage: Ensure etcd is running and API server etcd connection settings are correct; test manual etcd connection.

4. Common Pod Status Anomaly Troubleshooting

Pending: Review pod events to determine unscheduled reasons such as insufficient resources or occupied HostPort.

Waiting or ContainerCreating: Check events for image pull failures, CNI network errors, or container startup issues.

ImagePullBackOff: Often caused by incorrect image name or private registry credentials; verify with docker pull and create appropriate Secret.

CrashLoopBackOff: Container started then exited unexpectedly; inspect logs and exec commands to find exit reasons like process failure or failed health checks.

Error: May stem from missing dependencies, resource quota exceedance, security policy violations, or insufficient container permissions.

Terminating or Unknown: May require node deletion, node recovery, or forced pod deletion with caution.

5. Kubernetes Fault‑Tracing Guide – Analyzing Container Exit Codes

This section explains the meanings of various pod states such as CrashLoopBackOff, InvalidImageName, and provides a range of container exit codes with typical interpretations (e.g., EXIT CODE 0 = normal exit, EXIT CODE 137 = container killed by SIGKILL).

This manual offers a fairly complete and systematic troubleshooting methodology for K8S operators, covering pod, node, and cluster‑level issues, helping quickly locate and resolve problems to ensure stable cluster operation. Feel free to use it!

operations Kubernetes Cluster Troubleshooting Pod node

Written by

Linux Cloud Computing Practice

Welcome to Linux Cloud Computing Practice. We offer high-quality articles on Linux, cloud computing, DevOps, networking and related topics. Dive in and start your Linux cloud computing journey!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.