Cloud Native 8 min read

How a Misconfigured Liveness Probe Crashed a Service – Lessons & Fixes

An overnight outage at a financial firm, caused by a misconfigured Kubernetes liveness probe that returned 200 before the app was ready, led to massive losses; the article explains the difference between liveness and readiness probes, proper configuration examples, real‑world scenarios, troubleshooting steps, and best‑practice recommendations to avoid similar failures.

Full-Stack DevOps & Kubernetes

Oct 10, 2025

How a Misconfigured Liveness Probe Crashed a Service – Lessons & Fixes

Incident Overview

At 02:00 a monitoring system reported a payment service outage. Multiple micro‑services entered CrashLoopBackOff, causing >1 M CNY loss in 37 minutes. The root cause was a mis‑configured liveness probe that returned 200 before the application was fully ready.

Understanding Liveness and Readiness Probes

Liveness Probe

Question: Is the application dead?

Failure consequence: Kubernetes restarts the container.

Typical use‑cases: Detect deadlocks, thread hangs, unrecoverable exceptions.

Key principle: Must be idempotent and have no side effects.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30   # > longest startup time
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Readiness Probe

Question: Is the application ready to receive traffic?

Failure consequence: Pod is removed from Service endpoints.

Typical use‑cases: Slow start‑up, initialization of resources, external dependencies not ready.

Key principle: Prevent traffic from reaching a pod that cannot serve requests.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 2

One‑line difference: Liveness checks “alive”, readiness checks “can work”.

Practical Health‑Check Strategies

Common Probe Types

HTTP GET: Most common for web services; a 200 response indicates success.

Exec: Runs a command inside the container, e.g. cat /tmp/healthy.

TCP Socket: Checks port connectivity, useful for databases, Redis, etc.

Real‑World Scenarios and Best Practices

Scenario 1 – Long Startup Dependencies

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 20
  failureThreshold: 10   # give dependent services enough time

When the application needs extra time to start, increase initialDelaySeconds and failureThreshold so the pod is not marked unready prematurely.

Scenario 2 – Over‑eager Liveness Probe

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60
  periodSeconds: 30
  failureThreshold: 2

Frequent checks with a short timeout can kill a healthy pod.

Scenario 3 – Mixed Probe for a Database Service

livenessProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 300

readinessProbe:
  exec:
    command:
      - mysql
      - -h127.0.0.1
      - -e
      - "SELECT 1"

First verify the container is alive, then verify the DB can answer queries.

Common Pitfalls & Remedies

Probe too sensitive: Pods restart endlessly – increase failureThreshold and initialDelaySeconds.

Wrong readiness path: Pod never becomes ready – ensure the endpoint returns 200.

Heavy probe logic: Causes CPU spikes – use lightweight checks such as in‑memory flags.

Ignoring startup order: Dependencies not ready – use Init Containers or delayed strategies.

Fast Troubleshooting Checklist

Inspect Probe Events

kubectl describe pod POD_NAME

Look for messages like “Liveness probe failed” or “Readiness probe failed”.

Check Pod Logs

kubectl logs POD_NAME

Verify Pod Ready State

kubectl get pods -o wide

The READY column shows ready containers / total containers .

Conclusion

Liveness and readiness probes are core mechanisms for self‑healing and high availability in Kubernetes. Proper configuration enables automatic recovery, zero‑downtime deployments, traffic protection, and overall system resilience, while a single mis‑configuration can cause catastrophic outages.

Kubernetes Readiness Probe Liveness Probe

Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Incident Overview

Understanding Liveness and Readiness Probes

Liveness Probe

Readiness Probe

Practical Health‑Check Strategies

Common Probe Types

Real‑World Scenarios and Best Practices

Scenario 1 – Long Startup Dependencies

Scenario 2 – Over‑eager Liveness Probe

Scenario 3 – Mixed Probe for a Database Service

Common Pitfalls & Remedies

Fast Troubleshooting Checklist

Inspect Probe Events

Check Pod Logs

Verify Pod Ready State

Conclusion

Full-Stack DevOps & Kubernetes

How this landed with the community

Was this worth your time?

0 Comments

Scenario 1 – Long Startup Dependencies

Scenario 2 – Over‑eager Liveness Probe

Scenario 3 – Mixed Probe for a Database Service