Cloud Native 12 min read

Why Kubernetes Deployments Cause Service Outages and How to Prevent Them

This article explains why a typical Deployment + LoadBalancer setup can experience downtime during updates, analyzes the pod lifecycle, endpoint synchronization, iptables/ipvs and SLB interactions, and provides concrete configuration steps—including readiness probes, preStop hooks, and Service traffic policies—to achieve zero‑downtime deployments.

Alibaba Cloud Native

Jun 2, 2020

Why Kubernetes Deployments Cause Service Outages and How to Prevent Them

Why service interruptions occur

Pod creation phase : During a rolling update a new pod is created, reaches Running state and is added to the Service Endpoints. The cloud load balancer (SLB) then routes traffic to the node hosting the pod. If the application inside the pod has not finished initializing, the pod cannot handle requests, causing a brief outage.

Pod deletion phase : When the old pod is terminated the following steps happen asynchronously: the pod is marked Terminating and removed from Endpoints, a preStop hook may run, a SIGTERM is sent, the container waits for terminationGracePeriodSeconds, and finally a SIGKILL kills the container. Overlap of these steps can leave the pod reachable via the SLB while it is no longer able to process requests, producing another outage.

iptables/ipvs cleanup : When a pod becomes Terminating, kube‑proxy removes its iptables/ipvs entries. The SLB may still consider the node healthy for a few seconds, so traffic can be forwarded to a node that no longer has matching iptables/ipvs rules, resulting in dropped packets.

SLB endpoint updates : The cloud service watches Endpoints changes and removes the node from the SLB backend. Ongoing long‑lived connections are then abruptly closed, causing interruption.

Mitigation strategies

Pod‑level configuration

Define a readinessProbe so the pod is added to Endpoints only after the application is ready.

Define a livenessProbe to restart unhealthy pods.

Use a preStop hook (e.g., sleep 30) to delay termination and allow in‑flight requests to finish.

Set terminationGracePeriodSeconds longer than the preStop duration (e.g., 60 s).

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: default
spec:
  containers:
  - name: nginx
    image: nginx
    livenessProbe:
      tcpSocket:
        port: 5084
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      tcpSocket:
        port: 5084
      initialDelaySeconds: 30
      periodSeconds: 30
    lifecycle:
      preStop:
        exec:
          command:
          - sleep
          - "30"
  terminationGracePeriodSeconds: 60

Service‑level configuration (externalTrafficPolicy)

Cluster mode (default) : All nodes are added to the SLB backend. This avoids downtime but consumes SLB quota and may hide the source IP due to NAT.

Local mode : Only nodes that have ready pods are added to the SLB backend. To avoid interruption, ensure each node retains at least one running pod during updates (e.g., set maxUnavailable: 0 in the Deployment strategy and use node affinity for in‑place rolling updates).

ENI mode (Alibaba Cloud specific) : Pods are attached directly to the SLB backend, bypassing kube‑proxy, eliminating iptables/ipvs‑related outages.

apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  externalTrafficPolicy: Cluster
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: nginx
  type: LoadBalancer

apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  externalTrafficPolicy: Local
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: nginx
  type: LoadBalancer

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/backend-type: "eni"
  name: nginx
spec:
  ports:
  - name: http
    port: 30080
    protocol: TCP
    targetPort: 80
  selector:
    app: nginx
  type: LoadBalancer

For Local mode , use a rolling update strategy with maxUnavailable: 0 and node affinity to guarantee at least one running pod per node throughout the upgrade.

Recommendations

Enable readinessProbe with appropriate initial delay and period for applications with long start‑up times.

Implement a preStop hook and set terminationGracePeriodSeconds to be longer than the hook duration.

Choose the Service mode that matches operational requirements:

Cluster mode for simplicity when SLB quota is not a concern.

Local mode with in‑place rolling updates to preserve source IP and avoid downtime.

ENI mode (Alibaba Cloud) for the most reliable zero‑downtime path.

References

Container lifecycle hooks

Configure Liveness, Readiness and Startup Probes

Access services via load balancer

Kubernetes best practice: graceful termination

Kubernetes community discussion: zero‑downtime deployments with externalTrafficPolicy: Local

Graceful Termination for External Traffic Policy Local

Applying graceful rollout in ACK

deployment Kubernetes Service Zero Downtime preStop Hook Readiness Probe loadbalancer ENI Cluster Mode local mode

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.