Achieving Zero‑Downtime Applications with Kubernetes
This article explains why and how to use Kubernetes features such as multiple pod replicas, PodDisruptionBudgets, deployment strategies, health probes, graceful termination, anti‑affinity, resource limits, and autoscaling to build zero‑downtime, highly available applications.
Container Image Location
If you have been using Docker for a while, pulling and using container images seems simple, but in production you often do not want to rely on remote, uncontrolled image registries for reasons such as registry disappearance, deleted tags, mutable images, and security compliance.
One solution is to sync container images from the source registry to your own registry.
Pod Count (Application Instances)
For high availability, your application needs at least two Kubernetes replicas (two Pods). Example deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2 # tells deployment to run 2 pods matching the template
template:
..A common mistake is assuming a single instance is enough because Kubernetes performs rolling updates; this only applies to deployment updates, not to scenarios such as node loss or resource exhaustion, which require multiple instances to avoid downtime.
Pod Disruption Budget
PodDisruptionBudget (PDB) specifies the number of Pods that may be unavailable during maintenance, ensuring the application stays available even when some Pods are terminated.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: my-appDeployment Strategies
Kubernetes supports two deployment strategies: RollingUpdate (default) and Recreate. RollingUpdate can be tuned with maxUnavailable and maxSurge to control rollout speed under heavy traffic.
Automatic Rollback
Automatic rollback is not built‑in; it requires third‑party tools like Helm, ArgoCD, or Spinnaker. Properly configured probes ensure that a failing Pod is not exposed to traffic and can trigger a rollback.
Probes
Liveness probes verify that a container is running, while readiness probes determine if it should receive traffic. Custom probes (e.g., HTTP endpoints) are often more reliable than simple TCP checks.
Initial Startup Delay
Applications with heavy startup cost may need an increased initialDelaySeconds for liveness probes:
livenessProbe:
initialDelaySeconds: 60
httpGet:
...Graceful Termination (terminationGracePeriodSeconds)
Graceful termination only works if the application handles SIGTERM; otherwise the process is killed abruptly, potentially causing data loss or poor user experience.
Pod Anti‑Affinity
Pod anti‑affinity prevents multiple instances of the same application from being scheduled on the same node, reducing the risk of simultaneous failure.
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zoneResources
Insufficient memory leads to OOM kills; insufficient CPU can cause slow responses or prevent readiness probes from succeeding.
Autoscaling
Horizontal Pod Autoscaling (HPA) adds Pods based on CPU utilization (or custom metrics) to handle traffic spikes, but it must be correctly configured.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
...
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50Conclusion
Kubernetes can provide zero‑downtime deployments when applications are cloud‑native and properly configured. Key practices include running at least two instances, adding health probes, handling SIGTERM, configuring autoscaling, allocating sufficient resources, using pod anti‑affinity, and adding a PodDisruptionBudget.
DevOps Cloud Academy
Exploring industry DevOps practices and technical expertise.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.