Kubernetes Pod Troubleshooting Guide: Diagnose CrashLoopBackOff, OOMKilled & More
A comprehensive, step‑by‑step guide for SREs and DevOps engineers to diagnose and resolve common Kubernetes pod issues—including CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, Evicted, and Terminating—by leveraging pod lifecycle knowledge, kubectl commands, logs, events, node inspection, scripts, real‑world case studies, and monitoring best practices.
Overview
This guide provides a systematic, production‑grade workflow for troubleshooting Kubernetes Pods (v1.32+). It covers the full pod lifecycle, explains each STATUS and Phase, and presents a layered diagnostic coordinate system that can be applied during on‑call incidents or routine health checks.
Pod Lifecycle & Phase Semantics
A Pod progresses through five primary phases:
Pending – the pod has been accepted by the API server but has not yet been scheduled or its containers are not created.
Running – at least one container is running; the pod is considered healthy unless a probe fails.
Succeeded – all containers terminated with exit code 0; the pod will not be restarted.
Failed – one or more containers terminated with a non‑zero exit code or the pod was evicted.
Unknown – the kubelet cannot report the pod status (node loss, network partition, etc.).
Within each phase the STATUS column can show many sub‑states (e.g., CrashLoopBackOff, ImagePullBackOff, Terminating, Evicted). Understanding the mapping between Phase and STATUS is the first step to a correct diagnosis.
Diagnostic Coordinate System
The troubleshooting path follows a top‑down model, moving from the API view to the node and network layers:
Layer 1 – API (kubectl) : kubectl get pods, kubectl describe pod, kubectl logs, kubectl get events, kubectl top pods.
Layer 2 – Scheduling : kubectl describe node, kubectl get pv/pvc, kubectl get ns.
Layer 3 – Runtime : crictl ps, crictl inspect, crictl logs, crictl images.
Layer 4 – System : journalctl -u kubelet, dmesg, cat /proc/pressure/*, free -h, df -h.
Layer 5 – Network : kubectl exec … curl|wget, ss -tlnp, iptables -t nat -L.
Step‑by‑Step Diagnostic Commands
1. Gather Core Information
# Basic pod overview (all namespaces)
kubectl get pods -A -o wide
# Detailed pod description
kubectl describe pod <pod> -n <ns>
# Current and previous container logs
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
# Recent events (sorted)
kubectl get events -n <ns> --sort-by=.lastTimestamp
# Resource usage (Metrics Server required)
kubectl top pod <pod> -n <ns>2. Pending / Scheduling Issues
# Inspect scheduling failures
kubectl describe pod <pod> -n <ns> | grep -i "FailedScheduling"
# Show node capacity and allocated resources
kubectl describe node <node> | grep -A5 "Allocated resources"
# Verify resource requests
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].resources.requests}'If the event shows Insufficient cpu or Insufficient memory, compare the pod's requests against the node's allocatable values. Adjust requests, add nodes, or use a ResourceQuota to prevent a single workload from starving the cluster.
3. Image Pull Errors (ImagePullBackOff / ErrImagePull)
# Check the exact error in the pod description
kubectl describe pod <pod> -n <ns> | grep -i "Pull"
# Verify the image exists (run on any machine with registry access)
docker manifest inspect registry.example.com/app:tag
# Test pulling from a node (if you have SSH access)
crictl pull registry.example.com/app:tag
# Validate imagePullSecret
kubectl get secret <secret> -n <ns> -o jsonpath='{.data\.dockerconfigjson}' | base64 -d | jq .Typical root causes are a typo in the image name, missing tag, registry authentication failure, or network policies blocking outbound traffic.
4. CrashLoopBackOff
# Show the last crash logs
kubectl logs <pod> -n <ns> --previous
# Inspect exit code and restart count
kubectl describe pod <pod> -n <ns> | grep -A5 "Containers:" | grep "Exit Code"Common reasons:
Application mis‑configuration (exit code 1).
Missing environment variables or ConfigMaps.
Uncaught exceptions (Java Exception, Python Traceback, etc.).
Resource limits causing OOM (see section 5).
5. OOMKilled
# Verify OOM reason
kubectl describe pod <pod> -n <ns> | grep -i "OOMKilled"
# Show container limits and requests
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].resources}'
# Real‑time memory usage
kubectl top pod <pod> -n <ns>Two sources exist:
cgroup OOM : container exceeds its memory.limit. Increase the limit or fix the leak.
Node‑level OOM : the node runs out of memory; the kernel kills a process. Check node pressure with dmesg | grep -i oom and consider adding nodes or reducing overall memory pressure.
6. CreateContainerConfigError
# Typical missing ConfigMap or Secret
kubectl describe pod <pod> -n <ns> | grep -i "ConfigMap" -A2
# Verify existence
kubectl get configmap -n <ns>
kubectl get secret -n <ns>If the pod references a non‑existent ConfigMap/Secret, create it or update the pod spec.
7. Evicted Pods
# List evicted pods
kubectl get pods -A --field-selector=status.phase=Failed -o jsonpath='{range .items[?(.status.reason=="Evicted")]}{.metadata.namespace}/{.metadata.name}
{end}'
# Inspect eviction reason
kubectl describe pod <pod> -n <ns> | grep -i "Evicted"Eviction is driven by node resource pressure (disk, memory, inode, PID). Use the node pressure audit (section 9) to pinpoint the offending node.
8. Terminating / Finalizer Stuck
# Show finalizers
kubectl get pod <pod> -n <ns> -o jsonpath='{.metadata.finalizers}'
# Remove a finalizer (use with caution)
kubectl patch pod <pod> -n <ns> -p '{"metadata":{"finalizers":null}}' --type=merge
# Force delete as a last resort
kubectl delete pod <pod> -n <ns> --force --grace-period=0Typical causes: a custom controller that crashed and never cleared its finalizer, or a long‑running preStop hook. Fix the controller or shorten the hook before resorting to manual removal.
9. Unknown Pods (Node Lost)
# Identify the node and its condition
kubectl get pod <pod> -n <ns> -o wide
kubectl describe node <node>
# Check node heartbeat
kubectl get node <node> -o jsonpath='{.status.conditions[?(@.type=="Ready")].lastHeartbeatTime}'
# If the node is unreachable, SSH into it (if possible) and inspect kubelet logs:
ssh <node>
journalctl -u kubelet --since "10m ago"When the node is truly dead, cordon and drain it, then let the controller manager recreate the pods on healthy nodes.
Exit Code Reference (Common Values)
0 – Successful exit. Usually indicates a Job that completed.
1 – Generic application error. Check the container logs.
2 – Shell misuse (e.g., wrong arguments). Verify command and args.
126 – Command not executable (permission issue).
127 – Command not found. Ensure the binary exists in the image and PATH is correct.
128+N – Process killed by signal N. 137 (SIGKILL) often means OOM; 139 (SIGSEGV) indicates a crash; 143 (SIGTERM) is a graceful termination.
Node Inspection Commands
# General node health
kubectl get node <node> -o wide
kubectl describe node <node>
# Resource pressure conditions
kubectl get node <node> -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")].status}'
kubectl get node <node> -o jsonpath='{.status.conditions[?(@.type=="DiskPressure")].status}'
# Low‑level system info (run on the node via SSH)
free -h
df -h
cat /proc/pressure/memory
journalctl -u kubelet | grep -i error | tail -20Debugging with kubectl debug
# Add an ephemeral container to a running pod (requires v1.32+)
kubectl debug -it <pod> -n <ns> \
--image=nicolaka/netshoot:latest \
--target=<container>
# Create a copy of a failing pod for offline inspection
kubectl debug <pod> -n <ns> \
--copy-to=debug-<pod> \
--container=<container> \
--image=busybox:1.37 -- /bin/sh -c "sleep 3600"
# Debug a node without SSH access
kubectl debug node/<node> -it --image=busybox:1.37Monitoring & Alerting (Prometheus Rules)
# CrashLoopBackOff – any container restart in the last 15 min
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is repeatedly crashing"
# OOMKilled – container terminated with OOM
- alert: PodOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} was OOMKilled"
# Pending too long – pod stuck in Pending >15 min
- alert: PodPendingTooLong
expr: kube_pod_status_phase{phase="Pending"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been Pending for >15 min"
# Node not Ready – node heartbeat missing >5 min
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is NotReady"Best‑Practice Pod Configuration
Always set resources.requests and resources.limits. Use the golden rule : request ≈ 50‑80 % of typical usage, limit ≈ 150‑200 % for CPU and 1.5‑2 × for memory.
Prefer Guaranteed QoS for critical services (requests = limits).
Define startupProbe for slow‑starting containers, then livenessProbe and readinessProbe. Example:
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # 30 × 10s = 5 min
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 15
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5Configure terminationGracePeriodSeconds and a short preStop hook to give the application time to shut down gracefully.
Use podAntiAffinity or PodDisruptionBudget to spread critical workloads across nodes and protect against node loss.
Apply ResourceQuota and LimitRange per namespace to prevent a single team from exhausting cluster resources.
Set a PriorityClass for production workloads so they survive node pressure over lower‑priority batch jobs.
Typical Root‑Cause Matrix
Configuration error – exit code 1, logs show missing env var or bad flag.
Missing dependency – logs contain "connection refused"; verify Service and Endpoints.
Probe mis‑configuration – frequent restarts; increase initialDelaySeconds or add startupProbe.
OOM – exit code 137; raise memory limit or fix memory leak.
Image pull failure – ImagePullBackOff; check image name, registry auth, network.
Scheduling failure – Pending with FailedScheduling; adjust node selector, affinity, or add capacity.
Eviction – node pressure; clean up disk, reduce pod requests, or add nodes.
Finalizer deadlock – Terminating stuck; investigate the owning controller.
Conclusion
Effective Kubernetes pod troubleshooting follows a disciplined, evidence‑driven workflow: start with kubectl get/describe to capture the pod phase and events, drill down to container logs, then examine node resources and system logs if the API layer does not reveal the cause. Coupled with solid pod specifications (requests/limits, probes, graceful termination) and proactive monitoring (Prometheus alerts, Grafana dashboards), teams can quickly isolate root causes, minimize downtime, and prevent recurrence.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
