Why Is My K8s Pod Stuck in CrashLoopBackOff? 5 Proven Troubleshooting Strategies
CrashLoopBackOff is a kubelet back‑off restart policy that can be triggered by application panics, OOM kills, mis‑configured probes, or image pull problems, and this guide walks you through five systematic debugging steps, from inspecting pod events and logs to using ephemeral containers and monitoring alerts.
Overview
CrashLoopBackOff is not an error code; it is the kubelet’s exponential back‑off restart policy. When a container exits, kubelet follows the pod’s restartPolicy. If the container repeatedly crashes, kubelet delays the next start (immediate, 10 s, 20 s, 40 s … up to a 300 s cap). Kubernetes 1.32 adds a pod‑level back‑off reset (KEP‑3329) that clears the counter after a successful run, reducing the impact of transient failures.
Key Technical Characteristics
Back‑off timing is visible in pod events.
The container Exit Code is the primary clue for the failure type.
Investigation must span three layers: application code, container runtime, and node environment.
Kubernetes 1.32 makes ephemeral containers GA, providing a non‑intrusive debugging method.
Troubleshooting Workflow
1. Gather Pod State and Logs
Run the basic commands:
kubectl describe pod <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> -c <container-name> --previousImportant sections:
State / Last State – shows the current reason (e.g., CrashLoopBackOff) and the previous termination reason and exit code.
Events – look for timestamps, repeat counts (e.g., "(x5 over 12m)"), and warnings such as BackOff.
2. Verify Container Start Command
In a pod spec, command overrides the image ENTRYPOINT; args correspond to CMD. A common mistake is to write command in the args format, which results in a non‑executable entry.
# Incorrect – command written as args format
spec:
containers:
- name: myapp
image: myapp:v2.1
command: ["--config", "/etc/config/app.yaml"]
# Correct – separate command and args
spec:
containers:
- name: myapp
image: myapp:v2.1
command: ["/usr/local/bin/myapp"]
args: ["--config", "/etc/config/app.yaml"]Validate the image’s default entrypoint with a temporary pod:
# Run the image with its default entrypoint
kubectl run test-cmd --image=myapp:v2.1 --restart=Never -- /bin/sh
# Exec into the pod and run the command manually
kubectl exec -it test-cmd -- /usr/local/bin/myapp --config /etc/config/app.yaml3. Probe Configuration
Kubernetes supports three probes: startupProbe – determines when the container has finished starting; failure kills the container. livenessProbe – checks if the container is still alive; failure triggers a restart. readinessProbe – decides if the container is ready to receive traffic; failure only removes it from Service endpoints.
Typical pitfalls and fixes:
# Pitfall: livenessProbe fires too early for a slow‑starting app
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10 # too short
periodSeconds: 5
failureThreshold: 3
# Fix: use a startupProbe for the slow start, then a tolerant livenessProbe
startupProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 5
failureThreshold: 30 # wait up to 150 s
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3Debug probes by inspecting pod events (e.g., "Unhealthy" messages) or by injecting an ephemeral container and curling the endpoint:
# Example using an ephemeral container
kubectl debug <pod-name> -it --image=busybox --target=myapp -- /bin/sh
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/healthz4. Detect OOMKilled
If the container exceeds resources.limits.memory, the kernel OOM killer terminates it with exit code 137.
# Check the last termination reason
kubectl describe pod <pod-name> | grep -A 10 "Last State"
# Look for node‑level OOM events
kubectl get events -n <namespace> --field-selector reason=OOMKilling
# Kernel log (requires node SSH)
dmesg | grep -i "oom\|killed process" | tail -20Two common OOM scenarios:
Limits too low – the application’s steady‑state memory exceeds the request.
Memory leak – usage grows over time; monitor trends with Prometheus.
Recommended limits (adjust to actual usage):
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi" # 2× request for burst
cpu: "500m"For Java workloads, add 30‑50 % overhead on top of -Xmx because heap‑outside memory is not covered by the JVM limit.
5. Init Containers and External Dependencies
Init containers can block the main container until a service becomes reachable. Example:
initContainers:
- name: wait-for-postgres
image: busybox:1.37
command: ['sh','-c','until nc -z postgres-svc 5432; do echo "waiting for postgres..."; sleep 2; done']
- name: wait-for-redis
image: busybox:1.37
command: ['sh','-c','until nc -z redis-svc 6379; do echo "waiting for redis..."; sleep 2; done']
containers:
- name: myapp
image: myapp:v2.1If an Init container hangs, the pod shows Init:0/2. Diagnose with:
# Describe init containers
kubectl describe pod <pod-name> | grep -A 20 "Init Containers"
# View init container logs
kubectl logs <pod-name> -c wait-for-postgresPrefer application‑level retry logic over long‑running Init containers for services that may restart after the pod is up.
6. Image and Permission Issues
Missing ENTRYPOINT binary – multi‑stage builds must copy the executable.
Architecture mismatch – e.g., ARM image on AMD64 node results in "exec format error".
Mutable tag overwritten – use image digests in production to avoid accidental rollbacks.
Permission problems arise when a pod runs as non‑root but needs to write to a volume. Example security context:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
volumes:
- name: data
emptyDir: {}
volumeMounts:
- name: data
mountPath: /dataCheck logs for "permission denied" and use an ephemeral container to inspect filesystem permissions.
7. Exit Code Quick Reference
0– normal exit; ensure the process stays in the foreground (use exec or tail -f). 1 – application error; inspect previous logs. 2 – shell misuse; verify command syntax. 126 – binary not executable; fix file permissions. 127 – command not found; ensure the binary exists in the image and PATH is correct. 137 – SIGKILL / OOMKilled; verify memory limits. 139 – SIGSEGV; indicates a crash in native code. 143 – SIGTERM; usually caused by preStop hook timeout – increase terminationGracePeriodSeconds.
Defensive Best Practices
Probe Strategy
Always configure startupProbe for containers that need more than a few seconds to start.
Make livenessProbe conservative : failureThreshold ≥ 3, periodSeconds ≥ 10.
Readiness probes can be aggressive because they only affect traffic routing.
Resource Policy
resources:
requests:
memory: "<em>actual‑steady‑memory</em>"
cpu: "<em>actual‑steady‑cpu</em>"
limits:
memory: "1.5-2×requests"
cpu: "optional‑based‑team‑policy"Replace the placeholder values with the measured steady‑state usage of the application.
Application‑Level Defenses
Handle termination signals gracefully to allow clean shutdown:
func main() {
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
defer cancel()
server := &http.Server{Addr: ":8080"}
go func() {
<-ctx.Done()
shutdownCtx, _ := context.WithTimeout(context.Background(), 30*time.Second)
server.Shutdown(shutdownCtx)
}()
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("server error: %v", err)
}
}Include retry parameters for external services (e.g., DB connection retries) instead of relying solely on Init containers.
Image Security
containers:
- name: myapp
image: registry.example.com/myapp@sha256:a1b2c3d4...
imagePullPolicy: IfNotPresent
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]Monitoring, Logging, and Alerting
Because kubectl logs may lose data after a crash, deploy a log aggregation DaemonSet (Fluent Bit, Loki, Elasticsearch) that collects container stdout/stderr.
Prometheus Alert Rules (example)
# Pod restarts too often
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"
description: "Container {{ $labels.container }} has restarted {{ $value }} times in the last 15 minutes."
# OOMKilled detection
alert: ContainerOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} was OOMKilled"
description: "Check memory limits or investigate memory leaks."
# Persistent CrashLoopBackOff
alert: ContainerNotReady
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in CrashLoopBackOff for >10 minutes"
description: "Immediate investigation required."Grafana dashboards should display panels for top‑restarting pods, current CrashLoopBackOff pods, OOM event timelines, and containers whose memory usage exceeds 85 % of their limits.
Ephemeral Container Debugging
# Inject a busybox container sharing the target’s PID namespace
kubectl debug <pod-name> -it \
--image=busybox:1.37 \
--target=myapp \
-- /bin/sh
# Inside the shell you can:
# - ps aux (view target processes)
# - ls /proc/1/root/ (inspect target filesystem)
# - nc -z postgres-svc 5432 (test connectivity)Standard Runbook Checklist
Run kubectl describe pod and note the Exit Code.
Inspect the previous container logs with kubectl logs --previous.
Validate startupProbe and livenessProbe settings.
Check resources.requests and resources.limits; look for OOM events.
Confirm Init containers, ConfigMaps, Secrets, and NetworkPolicies are correct.
Key Takeaways
CrashLoopBackOff is an exponential back‑off restart mechanism; K8s 1.32 can reset the counter after a successful run.
Exit codes map to specific failure domains (e.g., 137 = OOM/SIGKILL, 139 = segmentation fault, 0 = unexpected normal exit). startupProbe is essential for slow‑starting workloads.
Ephemeral containers provide powerful, non‑intrusive debugging without modifying the pod spec.
Set memory limits at 1.5‑2 × the measured request and monitor with Prometheus.
Centralized log aggregation and alerting are mandatory for production reliability.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
