Avoid Million‑Dollar Outages: Master Kubernetes Liveness & Readiness Probes
A financial services outage caused by misconfigured Kubernetes liveness and readiness probes illustrates how misunderstanding these health checks can trigger costly restart loops, while this guide explains their core differences, proper configuration, advanced strategies, common pitfalls, and monitoring techniques to ensure stable, resilient services.
Real‑world incident
A payment service deployed in a financial company entered a "restart‑crash‑restart" loop. The pod kept being killed and recreated for 37 minutes, causing a loss of over one million yuan. The root cause was an incorrectly configured liveness probe that became reachable before the application finished its initialization. Kubernetes considered the container alive, repeatedly restarted it, and eventually exhausted system resources.
Understanding Liveness and Readiness Probes
Liveness Probe answers "Is the application running?". If the probe fails, the container is restarted. It is used to detect deadlocks, hangs, or any state that requires a restart to recover. The probe must be idempotent and must not cause data inconsistency when the container restarts.
Readiness Probe answers "Is the application ready to receive traffic?". If the probe fails, the pod is removed from the Service endpoint list, preventing traffic from being sent to an unready instance. It is useful for slow‑starting applications, heavy data loading, or temporary overload.
Basic configuration example
apiVersion: v1
kind: Pod
metadata:
name: web-app
spec:
containers:
- name: web
image: nginx:latest
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 15 # give the container enough startup time
periodSeconds: 10
failureThreshold: 3 # restart after 3 consecutive failures
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 5
successThreshold: 1
failureThreshold: 1Practical configuration guide
Golden rules for timing parameters
livenessProbe:
httpGet:
path: /liveness
port: 8080
initialDelaySeconds: 30 # must be greater than the longest startup time
periodSeconds: 10 # interval between checks
timeoutSeconds: 5 # must be less than periodSeconds
failureThreshold: 3 # avoid restart on transient failures
successThreshold: 1
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 1 # readiness can be more sensitive
successThreshold: 1Probe types and typical use cases
HTTP GET : most common, suitable for web services.
Exec : runs a command; success is indicated by exit code 0.
TCP Socket : only checks if the port is open, useful for non‑HTTP services.
httpGet:
path: /health
port: 8080
httpHeaders:
- name: Custom-Header
value: Awesome exec:
command:
- sh
- -c
- ps aux | grep myapp | grep -v grep tcpSocket:
port: 3306Advanced scenarios
Dependent services startup
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 2
failureThreshold: 10 # allow longer wait for dependent servicesAvoid aggressive liveness
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60 # give ample startup time
periodSeconds: 30 # lower frequency to reduce pressure
failureThreshold: 2 # avoid restart on momentary failuresMixed probes for a database pod
# Database pod example
livenessProbe:
tcpSocket:
port: 3306
initialDelaySeconds: 300 # databases need longer startup time
periodSeconds: 30
readinessProbe:
exec:
command:
- mysql
- -h127.0.0.1
- -e
- "SELECT 1"
initialDelaySeconds: 30
periodSeconds: 10Common pitfalls and solutions
Pitfall 1: Over‑sensitive liveness – Symptom: pod keeps restarting while logs show the app is fine. Solution: increase failureThreshold and/or increase periodSeconds to avoid restarts on transient glitches.
Pitfall 2: Wrong readiness path – Symptom: pod never becomes ready, service remains inaccessible. Solution: ensure the readiness endpoint returns the correct HTTP status (e.g., 200).
Pitfall 3: Health‑check performance impact – Symptom: the health endpoint degrades overall application performance. Solution: keep the check lightweight; avoid heavy database queries or long computations.
Pitfall 4: Ignoring startup order – Symptom: application fails because dependent services are not ready. Solution: set appropriate initialDelaySeconds or use Init Containers to enforce ordering.
Monitoring and debugging
Inspect pod details and events:
kubectl describe pod <pod-name>
# Pay attention to the Events and Conditions sections
# Common messages:
# - Liveness probe failed: container will be restarted
# - Readiness probe failed: container will be removed from service endpointsCheck pod readiness status:
kubectl get pods -o wide
# READY column shows ready/total containers
# STATUS column shows Running, Waiting, Terminating, etc.View container logs for deeper insight:
kubectl logs <pod-name> [-c <container-name>]Conclusion
Liveness and readiness probes are essential mechanisms for reliable Kubernetes deployments. Proper configuration provides automatic recovery, zero‑downtime updates, prevents cascading failures, and supports graceful startup and shutdown, leading to a more stable and performant service.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
