Why Kubernetes Deployments Cause Service Outages and How to Prevent Them
This article explains why a typical Deployment + LoadBalancer setup can experience downtime during updates, analyzes the pod lifecycle, endpoint synchronization, iptables/ipvs and SLB interactions, and provides concrete configuration steps—including readiness probes, preStop hooks, and Service traffic policies—to achieve zero‑downtime deployments.
Why service interruptions occur
Pod creation phase : During a rolling update a new pod is created, reaches Running state and is added to the Service Endpoints. The cloud load balancer (SLB) then routes traffic to the node hosting the pod. If the application inside the pod has not finished initializing, the pod cannot handle requests, causing a brief outage.
Pod deletion phase : When the old pod is terminated the following steps happen asynchronously: the pod is marked Terminating and removed from Endpoints, a preStop hook may run, a SIGTERM is sent, the container waits for terminationGracePeriodSeconds, and finally a SIGKILL kills the container. Overlap of these steps can leave the pod reachable via the SLB while it is no longer able to process requests, producing another outage.
iptables/ipvs cleanup : When a pod becomes Terminating, kube‑proxy removes its iptables/ipvs entries. The SLB may still consider the node healthy for a few seconds, so traffic can be forwarded to a node that no longer has matching iptables/ipvs rules, resulting in dropped packets.
SLB endpoint updates : The cloud service watches Endpoints changes and removes the node from the SLB backend. Ongoing long‑lived connections are then abruptly closed, causing interruption.
Mitigation strategies
Pod‑level configuration
Define a readinessProbe so the pod is added to Endpoints only after the application is ready.
Define a livenessProbe to restart unhealthy pods.
Use a preStop hook (e.g., sleep 30) to delay termination and allow in‑flight requests to finish.
Set terminationGracePeriodSeconds longer than the preStop duration (e.g., 60 s).
apiVersion: v1
kind: Pod
metadata:
name: nginx
namespace: default
spec:
containers:
- name: nginx
image: nginx
livenessProbe:
tcpSocket:
port: 5084
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
tcpSocket:
port: 5084
initialDelaySeconds: 30
periodSeconds: 30
lifecycle:
preStop:
exec:
command:
- sleep
- "30"
terminationGracePeriodSeconds: 60Service‑level configuration (externalTrafficPolicy)
Cluster mode (default) : All nodes are added to the SLB backend. This avoids downtime but consumes SLB quota and may hide the source IP due to NAT.
Local mode : Only nodes that have ready pods are added to the SLB backend. To avoid interruption, ensure each node retains at least one running pod during updates (e.g., set maxUnavailable: 0 in the Deployment strategy and use node affinity for in‑place rolling updates).
ENI mode (Alibaba Cloud specific) : Pods are attached directly to the SLB backend, bypassing kube‑proxy, eliminating iptables/ipvs‑related outages.
apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
externalTrafficPolicy: Cluster
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
run: nginx
type: LoadBalancer apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
externalTrafficPolicy: Local
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
run: nginx
type: LoadBalancer apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/backend-type: "eni"
name: nginx
spec:
ports:
- name: http
port: 30080
protocol: TCP
targetPort: 80
selector:
app: nginx
type: LoadBalancerFor Local mode , use a rolling update strategy with maxUnavailable: 0 and node affinity to guarantee at least one running pod per node throughout the upgrade.
Recommendations
Enable readinessProbe with appropriate initial delay and period for applications with long start‑up times.
Implement a preStop hook and set terminationGracePeriodSeconds to be longer than the hook duration.
Choose the Service mode that matches operational requirements:
Cluster mode for simplicity when SLB quota is not a concern.
Local mode with in‑place rolling updates to preserve source IP and avoid downtime.
ENI mode (Alibaba Cloud) for the most reliable zero‑downtime path.
References
Container lifecycle hooks
Configure Liveness, Readiness and Startup Probes
Access services via load balancer
Kubernetes best practice: graceful termination
Kubernetes community discussion: zero‑downtime deployments with externalTrafficPolicy: Local
Graceful Termination for External Traffic Policy Local
Applying graceful rollout in ACK
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
