How a Misconfigured Liveness Probe Crashed a Service – Lessons & Fixes
An overnight outage at a financial firm, caused by a misconfigured Kubernetes liveness probe that returned 200 before the app was ready, led to massive losses; the article explains the difference between liveness and readiness probes, proper configuration examples, real‑world scenarios, troubleshooting steps, and best‑practice recommendations to avoid similar failures.
Incident Overview
At 02:00 a monitoring system reported a payment service outage. Multiple micro‑services entered CrashLoopBackOff, causing >1 M CNY loss in 37 minutes. The root cause was a mis‑configured liveness probe that returned 200 before the application was fully ready.
Understanding Liveness and Readiness Probes
Liveness Probe
Question: Is the application dead?
Failure consequence: Kubernetes restarts the container.
Typical use‑cases: Detect deadlocks, thread hangs, unrecoverable exceptions.
Key principle: Must be idempotent and have no side effects.
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # > longest startup time
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3Readiness Probe
Question: Is the application ready to receive traffic?
Failure consequence: Pod is removed from Service endpoints.
Typical use‑cases: Slow start‑up, initialization of resources, external dependencies not ready.
Key principle: Prevent traffic from reaching a pod that cannot serve requests.
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2One‑line difference: Liveness checks “alive”, readiness checks “can work”.
Practical Health‑Check Strategies
Common Probe Types
HTTP GET: Most common for web services; a 200 response indicates success.
Exec: Runs a command inside the container, e.g. cat /tmp/healthy.
TCP Socket: Checks port connectivity, useful for databases, Redis, etc.
Real‑World Scenarios and Best Practices
Scenario 1 – Long Startup Dependencies
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 20
failureThreshold: 10 # give dependent services enough timeWhen the application needs extra time to start, increase initialDelaySeconds and failureThreshold so the pod is not marked unready prematurely.
Scenario 2 – Over‑eager Liveness Probe
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 2Frequent checks with a short timeout can kill a healthy pod.
Scenario 3 – Mixed Probe for a Database Service
livenessProbe:
tcpSocket:
port: 3306
initialDelaySeconds: 300
readinessProbe:
exec:
command:
- mysql
- -h127.0.0.1
- -e
- "SELECT 1"First verify the container is alive, then verify the DB can answer queries.
Common Pitfalls & Remedies
Probe too sensitive: Pods restart endlessly – increase failureThreshold and initialDelaySeconds.
Wrong readiness path: Pod never becomes ready – ensure the endpoint returns 200.
Heavy probe logic: Causes CPU spikes – use lightweight checks such as in‑memory flags.
Ignoring startup order: Dependencies not ready – use Init Containers or delayed strategies.
Fast Troubleshooting Checklist
Inspect Probe Events
kubectl describe pod POD_NAMELook for messages like “Liveness probe failed” or “Readiness probe failed”.
Check Pod Logs
kubectl logs POD_NAMEVerify Pod Ready State
kubectl get pods -o wideThe READY column shows ready containers / total containers .
Conclusion
Liveness and readiness probes are core mechanisms for self‑healing and high availability in Kubernetes. Proper configuration enables automatic recovery, zero‑downtime deployments, traffic protection, and overall system resilience, while a single mis‑configuration can cause catastrophic outages.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
