Master the Top 10 Kubernetes Troubleshooting Techniques Every DevOps Engineer Needs
This guide walks DevOps engineers through ten essential Kubernetes troubleshooting techniques—covering CrashLoopBackOff, ImagePullBackOff, NotReady nodes, Pending pods, and OOMKilled errors—with step‑by‑step commands, log analysis, and resource management strategies to quickly diagnose and resolve common cluster issues.
1. Fix CrashLoopBackOff errors
CrashLoopBackOff occurs when a pod repeatedly fails to start. Identify affected pods and investigate the cause.
Step 1: View all pod statuses
kubectl get podsLook for pods whose STATUS column shows CrashLoopBackOff.
Step 2: Describe the problematic pod
kubectl describe pod <pod-name>Check the events section for clues such as image‑pull failures, missing configuration, or permission issues.
Step 3: View container logs
kubectl logs <pod-name>If the container exits before producing logs, add the --previous flag to see logs from the last failed instance.
Example
kubectl get pods
my-webapp-pod 0/1 CrashLoopBackOff 5 (2m ago)Describing the pod reveals events; checking logs shows a missing DATABASE_URL environment variable. Adding the variable resolves the issue.
2. Diagnose ImagePullBackOff failures
ImagePullBackOff appears when Kubernetes cannot pull a container image, often due to authentication problems or an incorrect image name.
Step 1: Locate the failing deployment
kubectl get deploymentsPay attention to the READY column; a value like 0/3 indicates pod‑level problems.
Step 2: Track rollout status
kubectl rollout status deployment <deployment-name>Step 3: Inspect the failing pod
kubectl get pods
my-app-7d4b8c8f-xyz 0/1 ImagePullBackOff 0 2m kubectl describe pod my-app-7d4b8c8f-xyzThe description shows an error such as "Failed to pull image \"private-registry.com/my-app:v1.2.3\": pull access denied".
Step 4: Create a Docker registry secret
kubectl create secret docker-registry my-registry-secret \
--docker-server=private-registry.com \
--docker-username=myuser \
--docker-password=mypassword \
[email protected]Step 5: Patch the deployment to use the secret
kubectl patch deployment my-app -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"my-registry-secret"}]}}}}'After patching, the deployment pulls the image successfully. Verify with kubectl rollout status deployment my-app.
3. Resolve NotReady node errors
A NotReady node prevents pod scheduling and can cause workload downtime.
Step 1: Check node statuses
kubectl get nodes -o wideStep 2: Examine node conditions
kubectl describe node <node-name> | grep -A 5 "Capacity\|Allocatable"Look for conditions such as DiskPressure or KubeletHasDiskPressure.
Step 3: Fix disk pressure
sudo journalctl --vacuum-time=3dCleaning old logs frees space and returns the node to Ready state.
4. Diagnose Pending pods and service connectivity
Pending indicates scheduling problems, often caused by selector mismatches, network policies, or DNS issues.
Step 1: Verify services
kubectl get services --all-namespacesStep 2: List endpoints
kubectl get endpointsIf endpoints are empty, no pod matches the service selector.
Step 3: Check DNS resolution inside a pod
nslookup my-service
nslookup my-service.defaultFailure to resolve indicates DNS configuration problems.
Step 4: Test HTTP connectivity
wget -qO- my-service:80/healthNon‑successful requests suggest misconfigured service, network policy, or selector.
5. Address OOMKilled errors
OOMKilled occurs when a container exceeds its memory limit, leading to pod eviction.
Step 1: Check cluster‑wide resource usage
kubectl top nodesStep 2: Identify high‑usage pods
kubectl top pods --all-namespaces --sort-by=memoryStep 3: Review pod resource requests and limits
kubectl describe pod <pod-name> | grep -A 10 "Requests\|Limits"Step 4: Enable Horizontal Pod Autoscaling
kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10Verify HPA status with kubectl get hpa and kubectl describe hpa my-app.
By systematically checking pod status, events, logs, node health, service configuration, DNS, and resource limits, engineers can quickly pinpoint and resolve the most common Kubernetes failures.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
