Cloud Native 12 min read

Master the Top 10 Kubernetes Troubleshooting Techniques Every DevOps Engineer Needs

This guide walks DevOps engineers through ten essential Kubernetes troubleshooting techniques—covering CrashLoopBackOff, ImagePullBackOff, NotReady nodes, Pending pods, and OOMKilled errors—with step‑by‑step commands, log analysis, and resource management strategies to quickly diagnose and resolve common cluster issues.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Master the Top 10 Kubernetes Troubleshooting Techniques Every DevOps Engineer Needs

1. Fix CrashLoopBackOff errors

CrashLoopBackOff occurs when a pod repeatedly fails to start. Identify affected pods and investigate the cause.

Step 1: View all pod statuses

kubectl get pods

Look for pods whose STATUS column shows CrashLoopBackOff.

Step 2: Describe the problematic pod

kubectl describe pod <pod-name>

Check the events section for clues such as image‑pull failures, missing configuration, or permission issues.

Step 3: View container logs

kubectl logs <pod-name>

If the container exits before producing logs, add the --previous flag to see logs from the last failed instance.

Example

kubectl get pods
my-webapp-pod   0/1   CrashLoopBackOff   5 (2m ago)

Describing the pod reveals events; checking logs shows a missing DATABASE_URL environment variable. Adding the variable resolves the issue.

2. Diagnose ImagePullBackOff failures

ImagePullBackOff appears when Kubernetes cannot pull a container image, often due to authentication problems or an incorrect image name.

Step 1: Locate the failing deployment

kubectl get deployments

Pay attention to the READY column; a value like 0/3 indicates pod‑level problems.

Step 2: Track rollout status

kubectl rollout status deployment <deployment-name>

Step 3: Inspect the failing pod

kubectl get pods
my-app-7d4b8c8f-xyz   0/1   ImagePullBackOff   0   2m
kubectl describe pod my-app-7d4b8c8f-xyz

The description shows an error such as "Failed to pull image \"private-registry.com/my-app:v1.2.3\": pull access denied".

Step 4: Create a Docker registry secret

kubectl create secret docker-registry my-registry-secret \
  --docker-server=private-registry.com \
  --docker-username=myuser \
  --docker-password=mypassword \
  [email protected]

Step 5: Patch the deployment to use the secret

kubectl patch deployment my-app -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"my-registry-secret"}]}}}}'

After patching, the deployment pulls the image successfully. Verify with kubectl rollout status deployment my-app.

3. Resolve NotReady node errors

A NotReady node prevents pod scheduling and can cause workload downtime.

Step 1: Check node statuses

kubectl get nodes -o wide

Step 2: Examine node conditions

kubectl describe node <node-name> | grep -A 5 "Capacity\|Allocatable"

Look for conditions such as DiskPressure or KubeletHasDiskPressure.

Step 3: Fix disk pressure

sudo journalctl --vacuum-time=3d

Cleaning old logs frees space and returns the node to Ready state.

4. Diagnose Pending pods and service connectivity

Pending indicates scheduling problems, often caused by selector mismatches, network policies, or DNS issues.

Step 1: Verify services

kubectl get services --all-namespaces

Step 2: List endpoints

kubectl get endpoints

If endpoints are empty, no pod matches the service selector.

Step 3: Check DNS resolution inside a pod

nslookup my-service
nslookup my-service.default

Failure to resolve indicates DNS configuration problems.

Step 4: Test HTTP connectivity

wget -qO- my-service:80/health

Non‑successful requests suggest misconfigured service, network policy, or selector.

5. Address OOMKilled errors

OOMKilled occurs when a container exceeds its memory limit, leading to pod eviction.

Step 1: Check cluster‑wide resource usage

kubectl top nodes

Step 2: Identify high‑usage pods

kubectl top pods --all-namespaces --sort-by=memory

Step 3: Review pod resource requests and limits

kubectl describe pod <pod-name> | grep -A 10 "Requests\|Limits"

Step 4: Enable Horizontal Pod Autoscaling

kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10

Verify HPA status with kubectl get hpa and kubectl describe hpa my-app.

By systematically checking pod status, events, logs, node health, service configuration, DNS, and resource limits, engineers can quickly pinpoint and resolve the most common Kubernetes failures.

KubernetesDevOpscontainerkubectl
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.