Cloud Native 9 min read

Avoid Million‑Dollar Outages: Master Kubernetes Liveness & Readiness Probes

A financial services outage caused by misconfigured Kubernetes liveness and readiness probes illustrates how misunderstanding these health checks can trigger costly restart loops, while this guide explains their core differences, proper configuration, advanced strategies, common pitfalls, and monitoring techniques to ensure stable, resilient services.

Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Avoid Million‑Dollar Outages: Master Kubernetes Liveness & Readiness Probes

Real‑world incident

A payment service deployed in a financial company entered a "restart‑crash‑restart" loop. The pod kept being killed and recreated for 37 minutes, causing a loss of over one million yuan. The root cause was an incorrectly configured liveness probe that became reachable before the application finished its initialization. Kubernetes considered the container alive, repeatedly restarted it, and eventually exhausted system resources.

Understanding Liveness and Readiness Probes

Liveness Probe answers "Is the application running?". If the probe fails, the container is restarted. It is used to detect deadlocks, hangs, or any state that requires a restart to recover. The probe must be idempotent and must not cause data inconsistency when the container restarts.

Readiness Probe answers "Is the application ready to receive traffic?". If the probe fails, the pod is removed from the Service endpoint list, preventing traffic from being sent to an unready instance. It is useful for slow‑starting applications, heavy data loading, or temporary overload.

Basic configuration example

apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  containers:
  - name: web
    image: nginx:latest
    livenessProbe:
      httpGet:
        path: /healthz
        port: 80
      initialDelaySeconds: 15   # give the container enough startup time
      periodSeconds: 10
      failureThreshold: 3       # restart after 3 consecutive failures
    readinessProbe:
      httpGet:
        path: /ready
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      failureThreshold: 1

Practical configuration guide

Golden rules for timing parameters

livenessProbe:
  httpGet:
    path: /liveness
    port: 8080
  initialDelaySeconds: 30   # must be greater than the longest startup time
  periodSeconds: 10          # interval between checks
  timeoutSeconds: 5          # must be less than periodSeconds
  failureThreshold: 3       # avoid restart on transient failures
  successThreshold: 1

readinessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 1       # readiness can be more sensitive
  successThreshold: 1

Probe types and typical use cases

HTTP GET : most common, suitable for web services.

Exec : runs a command; success is indicated by exit code 0.

TCP Socket : only checks if the port is open, useful for non‑HTTP services.

httpGet:
  path: /health
  port: 8080
  httpHeaders:
  - name: Custom-Header
    value: Awesome
exec:
  command:
  - sh
  - -c
  - ps aux | grep myapp | grep -v grep
tcpSocket:
  port: 3306

Advanced scenarios

Dependent services startup

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 2
  failureThreshold: 10   # allow longer wait for dependent services

Avoid aggressive liveness

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60   # give ample startup time
  periodSeconds: 30          # lower frequency to reduce pressure
  failureThreshold: 2        # avoid restart on momentary failures

Mixed probes for a database pod

# Database pod example
livenessProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 300   # databases need longer startup time
  periodSeconds: 30

readinessProbe:
  exec:
    command:
    - mysql
    - -h127.0.0.1
    - -e
    - "SELECT 1"
  initialDelaySeconds: 30
  periodSeconds: 10

Common pitfalls and solutions

Pitfall 1: Over‑sensitive liveness – Symptom: pod keeps restarting while logs show the app is fine. Solution: increase failureThreshold and/or increase periodSeconds to avoid restarts on transient glitches.

Pitfall 2: Wrong readiness path – Symptom: pod never becomes ready, service remains inaccessible. Solution: ensure the readiness endpoint returns the correct HTTP status (e.g., 200).

Pitfall 3: Health‑check performance impact – Symptom: the health endpoint degrades overall application performance. Solution: keep the check lightweight; avoid heavy database queries or long computations.

Pitfall 4: Ignoring startup order – Symptom: application fails because dependent services are not ready. Solution: set appropriate initialDelaySeconds or use Init Containers to enforce ordering.

Monitoring and debugging

Inspect pod details and events:

kubectl describe pod <pod-name>
# Pay attention to the Events and Conditions sections
# Common messages:
# - Liveness probe failed: container will be restarted
# - Readiness probe failed: container will be removed from service endpoints

Check pod readiness status:

kubectl get pods -o wide
# READY column shows ready/total containers
# STATUS column shows Running, Waiting, Terminating, etc.

View container logs for deeper insight:

kubectl logs <pod-name> [-c <container-name>]

Conclusion

Liveness and readiness probes are essential mechanisms for reliable Kubernetes deployments. Proper configuration provides automatic recovery, zero‑downtime updates, prevents cascading failures, and supports graceful startup and shutdown, leading to a more stable and performant service.

KubernetesHealth CheckReadiness ProbeLiveness Probe
Full-Stack DevOps & Kubernetes
Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.