Cloud Native 31 min read

Why Is My K8s Pod Stuck in CrashLoopBackOff? 5 Proven Troubleshooting Strategies

CrashLoopBackOff is a kubelet back‑off restart policy that can be triggered by application panics, OOM kills, mis‑configured probes, or image pull problems, and this guide walks you through five systematic debugging steps, from inspecting pod events and logs to using ephemeral containers and monitoring alerts.

Ops Community
Ops Community
Ops Community
Why Is My K8s Pod Stuck in CrashLoopBackOff? 5 Proven Troubleshooting Strategies

Overview

CrashLoopBackOff is not an error code; it is the kubelet’s exponential back‑off restart policy. When a container exits, kubelet follows the pod’s restartPolicy. If the container repeatedly crashes, kubelet delays the next start (immediate, 10 s, 20 s, 40 s … up to a 300 s cap). Kubernetes 1.32 adds a pod‑level back‑off reset (KEP‑3329) that clears the counter after a successful run, reducing the impact of transient failures.

Key Technical Characteristics

Back‑off timing is visible in pod events.

The container Exit Code is the primary clue for the failure type.

Investigation must span three layers: application code, container runtime, and node environment.

Kubernetes 1.32 makes ephemeral containers GA, providing a non‑intrusive debugging method.

Troubleshooting Workflow

1. Gather Pod State and Logs

Run the basic commands:

kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

Important sections:

State / Last State – shows the current reason (e.g., CrashLoopBackOff) and the previous termination reason and exit code.

Events – look for timestamps, repeat counts (e.g., "(x5 over 12m)"), and warnings such as BackOff.

2. Verify Container Start Command

In a pod spec, command overrides the image ENTRYPOINT; args correspond to CMD. A common mistake is to write command in the args format, which results in a non‑executable entry.

# Incorrect – command written as args format
spec:
  containers:
  - name: myapp
    image: myapp:v2.1
    command: ["--config", "/etc/config/app.yaml"]

# Correct – separate command and args
spec:
  containers:
  - name: myapp
    image: myapp:v2.1
    command: ["/usr/local/bin/myapp"]
    args: ["--config", "/etc/config/app.yaml"]

Validate the image’s default entrypoint with a temporary pod:

# Run the image with its default entrypoint
kubectl run test-cmd --image=myapp:v2.1 --restart=Never -- /bin/sh
# Exec into the pod and run the command manually
kubectl exec -it test-cmd -- /usr/local/bin/myapp --config /etc/config/app.yaml

3. Probe Configuration

Kubernetes supports three probes: startupProbe – determines when the container has finished starting; failure kills the container. livenessProbe – checks if the container is still alive; failure triggers a restart. readinessProbe – decides if the container is ready to receive traffic; failure only removes it from Service endpoints.

Typical pitfalls and fixes:

# Pitfall: livenessProbe fires too early for a slow‑starting app
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10   # too short
  periodSeconds: 5
  failureThreshold: 3

# Fix: use a startupProbe for the slow start, then a tolerant livenessProbe
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 5
  failureThreshold: 30   # wait up to 150 s
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

Debug probes by inspecting pod events (e.g., "Unhealthy" messages) or by injecting an ephemeral container and curling the endpoint:

# Example using an ephemeral container
kubectl debug <pod-name> -it --image=busybox --target=myapp -- /bin/sh
curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/healthz

4. Detect OOMKilled

If the container exceeds resources.limits.memory, the kernel OOM killer terminates it with exit code 137.

# Check the last termination reason
kubectl describe pod <pod-name> | grep -A 10 "Last State"
# Look for node‑level OOM events
kubectl get events -n <namespace> --field-selector reason=OOMKilling
# Kernel log (requires node SSH)
dmesg | grep -i "oom\|killed process" | tail -20

Two common OOM scenarios:

Limits too low – the application’s steady‑state memory exceeds the request.

Memory leak – usage grows over time; monitor trends with Prometheus.

Recommended limits (adjust to actual usage):

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"   # 2× request for burst
    cpu: "500m"

For Java workloads, add 30‑50 % overhead on top of -Xmx because heap‑outside memory is not covered by the JVM limit.

5. Init Containers and External Dependencies

Init containers can block the main container until a service becomes reachable. Example:

initContainers:
- name: wait-for-postgres
  image: busybox:1.37
  command: ['sh','-c','until nc -z postgres-svc 5432; do echo "waiting for postgres..."; sleep 2; done']
- name: wait-for-redis
  image: busybox:1.37
  command: ['sh','-c','until nc -z redis-svc 6379; do echo "waiting for redis..."; sleep 2; done']
containers:
- name: myapp
  image: myapp:v2.1

If an Init container hangs, the pod shows Init:0/2. Diagnose with:

# Describe init containers
kubectl describe pod <pod-name> | grep -A 20 "Init Containers"
# View init container logs
kubectl logs <pod-name> -c wait-for-postgres

Prefer application‑level retry logic over long‑running Init containers for services that may restart after the pod is up.

6. Image and Permission Issues

Missing ENTRYPOINT binary – multi‑stage builds must copy the executable.

Architecture mismatch – e.g., ARM image on AMD64 node results in "exec format error".

Mutable tag overwritten – use image digests in production to avoid accidental rollbacks.

Permission problems arise when a pod runs as non‑root but needs to write to a volume. Example security context:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
volumes:
- name: data
  emptyDir: {}
volumeMounts:
- name: data
  mountPath: /data

Check logs for "permission denied" and use an ephemeral container to inspect filesystem permissions.

7. Exit Code Quick Reference

0

– normal exit; ensure the process stays in the foreground (use exec or tail -f). 1 – application error; inspect previous logs. 2 – shell misuse; verify command syntax. 126 – binary not executable; fix file permissions. 127 – command not found; ensure the binary exists in the image and PATH is correct. 137 – SIGKILL / OOMKilled; verify memory limits. 139 – SIGSEGV; indicates a crash in native code. 143 – SIGTERM; usually caused by preStop hook timeout – increase terminationGracePeriodSeconds.

Defensive Best Practices

Probe Strategy

Always configure startupProbe for containers that need more than a few seconds to start.

Make livenessProbe conservative : failureThreshold ≥ 3, periodSeconds ≥ 10.

Readiness probes can be aggressive because they only affect traffic routing.

Resource Policy

resources:
  requests:
    memory: "<em>actual‑steady‑memory</em>"
    cpu: "<em>actual‑steady‑cpu</em>"
  limits:
    memory: "1.5-2×requests"
    cpu: "optional‑based‑team‑policy"

Replace the placeholder values with the measured steady‑state usage of the application.

Application‑Level Defenses

Handle termination signals gracefully to allow clean shutdown:

func main() {
    ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
    defer cancel()
    server := &http.Server{Addr: ":8080"}
    go func() {
        <-ctx.Done()
        shutdownCtx, _ := context.WithTimeout(context.Background(), 30*time.Second)
        server.Shutdown(shutdownCtx)
    }()
    if err := server.ListenAndServe(); err != http.ErrServerClosed {
        log.Fatalf("server error: %v", err)
    }
}

Include retry parameters for external services (e.g., DB connection retries) instead of relying solely on Init containers.

Image Security

containers:
- name: myapp
  image: registry.example.com/myapp@sha256:a1b2c3d4...
  imagePullPolicy: IfNotPresent
  securityContext:
    runAsNonRoot: true
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]

Monitoring, Logging, and Alerting

Because kubectl logs may lose data after a crash, deploy a log aggregation DaemonSet (Fluent Bit, Loki, Elasticsearch) that collects container stdout/stderr.

Prometheus Alert Rules (example)

# Pod restarts too often
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
  severity: warning
annotations:
  summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is restarting frequently"
  description: "Container {{ $labels.container }} has restarted {{ $value }} times in the last 15 minutes."

# OOMKilled detection
alert: ContainerOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
  severity: critical
annotations:
  summary: "Container {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} was OOMKilled"
  description: "Check memory limits or investigate memory leaks."

# Persistent CrashLoopBackOff
alert: ContainerNotReady
expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
for: 10m
labels:
  severity: critical
annotations:
  summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in CrashLoopBackOff for >10 minutes"
  description: "Immediate investigation required."

Grafana dashboards should display panels for top‑restarting pods, current CrashLoopBackOff pods, OOM event timelines, and containers whose memory usage exceeds 85 % of their limits.

Ephemeral Container Debugging

# Inject a busybox container sharing the target’s PID namespace
kubectl debug <pod-name> -it \
  --image=busybox:1.37 \
  --target=myapp \
  -- /bin/sh
# Inside the shell you can:
# - ps aux (view target processes)
# - ls /proc/1/root/ (inspect target filesystem)
# - nc -z postgres-svc 5432 (test connectivity)

Standard Runbook Checklist

Run kubectl describe pod and note the Exit Code.

Inspect the previous container logs with kubectl logs --previous.

Validate startupProbe and livenessProbe settings.

Check resources.requests and resources.limits; look for OOM events.

Confirm Init containers, ConfigMaps, Secrets, and NetworkPolicies are correct.

Key Takeaways

CrashLoopBackOff is an exponential back‑off restart mechanism; K8s 1.32 can reset the counter after a successful run.

Exit codes map to specific failure domains (e.g., 137 = OOM/SIGKILL, 139 = segmentation fault, 0 = unexpected normal exit). startupProbe is essential for slow‑starting workloads.

Ephemeral containers provide powerful, non‑intrusive debugging without modifying the pod spec.

Set memory limits at 1.5‑2 × the measured request and monitor with Prometheus.

Centralized log aggregation and alerting are mandatory for production reliability.

debuggingKubernetesprobeCrashLoopBackOffephemeral-container
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.