Master Kubernetes Troubleshooting: From Pod Failures to DNS Issues
This guide walks through a systematic approach to diagnosing Kubernetes problems, covering pod startup failures, cluster health checks, event logs, pod status, network and storage verification, container logs, DNS service checks, and provides practical commands and tips for each step.
1. Pod startup failures
Pods are the smallest scheduling unit in Kubernetes; containers inside a pod share its network, storage, and resources. Common reasons for pod failures include:
Resource overallocation : Too many pods on a node can exhaust resources and cause node crashes.
Memory or CPU limits exceeded : Application memory leaks or high CPU usage can cause the pod to be killed. Mitigate by load‑testing and setting resource limits.
Network problems : Misconfigured CNI plugins (e.g., Calico) prevent pod communication.
Storage issues : Unavailable persistent volumes or mis‑mounted storage cause start‑up errors.
Code errors : Application code crashes after container start.
Configuration errors : Faulty Deployment or StatefulSet manifests prevent pod creation.
Use monitoring systems to help pinpoint the above issues.
2. Inspect cluster status
Check the overall health of the cluster with kubectl get nodes. Verify that all nodes are Ready and that core components such as etcd, kubelet, and kube‑proxy are running correctly.
3. Trace event logs
Inspect recent cluster events using kubectl get events. Event logs contain timestamps and messages about failures in components or applications, helping you locate the root cause.
4. Focus on pod status
List all pods across namespaces: kubectl get pods --all-namespaces. Identify pods that are not Running (e.g., Pending, CrashLoopBackOff). For a specific pod, run kubectl describe pod <pod-name> to see detailed status and events.
5. Check network connectivity
Verify service, pod, and node network communication. Use kubectl get services and kubectl describe service <svc-name>. Ensure network policies and firewall rules are correctly configured.
6. Review storage configuration
If your application uses Persistent Volumes (PV) or StorageClasses, confirm their status with kubectl get pv, kubectl get pvc, and kubectl get storageclass. Check that claims are bound and volumes are accessible.
7. Examine container logs
Fetch logs from a pod with kubectl logs <pod-name>. For pods with multiple containers, specify the container: kubectl logs <pod-name> -c <container-name>. Logs often reveal application‑level errors.
8. Cluster network communication
Kubernetes clusters rely on CNI plugins for internal networking. Common plugins include:
Calico – provides IP address allocation and network policy enforcement.
Flannel – simple IP address allocation.
Canel – a hybrid of Calico and Flannel.
Network communication patterns:
Between containers in the same pod.
Pod‑to‑Pod communication.
Pod‑to‑Service communication.
Service‑to‑external traffic.
9. Service DNS verification
Test DNS resolution from a pod:
u@pod$ nslookup hostnames
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: hostnames
Address 1: 10.0.1.175 hostnames.default.svc.cluster.localIf the lookup fails, the pod and service may be in different namespaces. Try a qualified name:
u@pod$ nslookup hostnames.default
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: hostnames.default
Address 1: 10.0.1.175 hostnames.default.svc.cluster.localOr use the fully‑qualified name:
u@pod$ nslookup hostnames.default.svc.cluster.local
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local
Name: hostnames.default.svc.cluster.local
Address 1: 10.0.1.175 hostnames.default.svc.cluster.localCheck /etc/resolv.conf to ensure the nameserver points to the cluster DNS service and that the search line contains the appropriate suffixes:
u@pod$ cat /etc/resolv.conf
nameserver 10.0.0.10
search default.svc.cluster.local svc.cluster.local cluster.local example.com
options ndots:5Adjust the DNS service IP or search domains if they differ in your cluster.
10. Summary
The exact troubleshooting steps depend on your cluster configuration, deployment method, and observed symptoms. By following the systematic checks above—examining pod health, cluster status, events, networking, storage, logs, and DNS—you can more confidently identify and resolve Kubernetes issues, keeping your applications stable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
