Master Kubernetes Troubleshooting: Common Issues and How to Fix Them
This guide walks you through the most frequent Kubernetes problems—from image pull failures and CrashLoopBackOff to DNS, storage, node readiness, and RBAC errors—providing clear diagnosis steps, essential kubectl commands, and concrete solutions to keep your clusters healthy.
Introduction
Two fundamental commands are the backbone of Kubernetes troubleshooting: kubectl describe (inspects resources and events) and kubectl logs (shows container output). Apply an "outside‑in, big‑to‑small" approach: Node → Pod → Container → Application.
1. Deployment & Configuration Issues
ImagePullBackOff / ErrImagePull
Symptoms : Pod status shows ImagePullBackOff or ErrImagePull.
Incorrect image name or tag.
Missing imagePullSecrets for private registries.
Network cannot reach the registry.
Image architecture mismatch.
Resolution :
Verify the image reference: kubectl describe pod <pod-name> Create a Docker registry secret (if needed) and attach it to the pod spec:
kubectl create secret docker-registry my-registry-key \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<pass> \
--docker-email=<email> imagePullSecrets:
- name: my-registry-keyTest pulling the image directly on the node:
docker pull <image>
# or
crictl pull <image>CrashLoopBackOff
Symptoms : Pod repeatedly crashes, alternating between CrashLoopBackOff and Error.
Investigation & Fix :
View the previous container logs: kubectl logs <pod-name> --previous Common causes: mis‑configuration, missing dependencies, insufficient permissions, or an incorrect start command.
Deploy a temporary debug container that stays alive to inspect the environment:
command: ["/bin/sh"]
args: ["-c", "sleep 3600"]Pending
Symptoms : Pod remains in Pending state.
Root cause : Scheduler cannot find a suitable node.
Inspect events for the pod: kubectl describe pod <pod-name> Typical reasons:
Insufficient CPU or memory – increase node capacity or lower resource requests.
Node selector, affinity or taints that do not match any node – adjust labels, affinity rules, or add tolerations.
2. Runtime Issues
Pod Running but Service Unreachable
Confirm the application is listening on the expected port (check container spec and logs).
Verify the Service selector matches the Pod labels and that targetPort is correct.
Check that Endpoints exist for the Service: kubectl get endpoints <svc> If the list is empty, no Pods match the selector.
Inspect any NetworkPolicy that might block traffic.
Debug from inside the cluster:
kubectl run debug --rm -it --image=busybox -- sh
wget <svc>.<ns>.svc.cluster.local:<port>DNS Resolution Failures
Symptoms : Pods cannot resolve service names.
Check CoreDNS pods are healthy:
kubectl get pods -n kube-system -l k8s-app=kube-dnsEnsure /etc/resolv.conf inside the pod contains the cluster DNS server (e.g., nameserver 10.96.0.10).
Run a DNS query from a pod:
nslookup kubernetes.default.svc.cluster.local3. Storage Issues
PVC Pending
List available StorageClasses: kubectl get storageclass Make sure the PVC's accessModes, size, and storage class match an existing PersistentVolume.
If using dynamic provisioning, verify that the provisioner pod is running and has no errors.
FailedMount
Inspect pod events for mount errors: kubectl describe pod <pod-name> Confirm the backend storage (NFS, Ceph, etc.) is reachable, the export path is correct, and required mount utilities are installed on the node.
Check permissions on the storage target.
4. Node & Cluster Issues
Node NotReady
Check kubelet status: systemctl status kubelet Verify the container runtime (docker or containerd) is active:
systemctl status docker # or systemctl status containerdInspect node resources – disk space and memory pressure:
df -h
free -hReview kubelet logs for clues:
journalctl -u kubelet -f5. Network & LoadBalancer Problems
NodePort Unreachable
Open the NodePort range in the host firewall (e.g., iptables or cloud security groups).
Ensure the Service has active Endpoints.
Confirm the node is listening on the port:
netstat -tunlp | grep <nodeport>LoadBalancer Not Working
In cloud environments, verify that a LoadBalancer controller (e.g., cloud‑provider integration, MetalLB, or ingress‑nginx) is installed and configured.
6. Job & CronJob Issues
Job stuck : Pod process is hanging or looping. Examine pod logs and exit codes.
CronJob not firing : Check the cron schedule expression and inspect the CronJob controller events.
kubectl get cronjob
kubectl describe cronjob <name>7. Security & RBAC Issues
Forbidden : Missing Role or ClusterRoleBinding. Create the appropriate binding for the service account.
Pod API access failures: Add serviceAccountName to the pod spec and bind the required permissions.
8. Resource Limits
OOMKilled : Container exceeded its memory limit. Increase resources.requests.memory and resources.limits.memory or optimise the application.
CPU throttling : CPU limit is too low. Raise resources.limits.cpu.
9. Toolbox
kubectl get <resource>– retrieve current status. kubectl describe <resource> – detailed view with events. kubectl logs <pod> – view container logs. kubectl exec -it <pod> -- sh – open an interactive shell inside the container. kubectl get events --all-namespaces – list cluster‑wide events. kubectl top nodes / kubectl top pods – show resource usage (requires metrics‑server). kubectl debug – attach a temporary debug container to an existing pod.
10. Practical Tips
Use kubectl explain <resource.field> to view API field documentation on the fly.
For aggregated logs, tools such as stern or kubetail can tail multiple pod logs simultaneously.
Leverage kubectl debug to inject a troubleshooting container without modifying the original pod spec.
11. Summary & Recommendations
Approximately 90 % of issues are resolved by examining kubectl describe output and container logs.
Understand the Pod lifecycle: Pending → Running → CrashLoopBackOff.
Validate that the container image runs locally before deploying to the cluster.
Isolate problems by reproducing them with a minimal Deployment and Service.
Pay close attention to resource requests/limits and RBAC permissions – OOMKilled, Forbidden, and empty Endpoints are frequent culprits.
Master the core debugging tools ( stern, kubectl debug, tcpdump) to accelerate root‑cause analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
