Cloud Native 13 min read

Prevent Massive K8s Outages: Scale, Redundancy, and Embrace Restarts

The article analyzes the November 27 Didi outage caused by an aggressive Kubernetes upgrade, then presents four engineering principles—controlling cluster size, eliminating single points of failure, treating restarts as normal, and decoupling data and control planes—to build more resilient cloud‑native systems.

ITPUB
ITPUB
ITPUB
Prevent Massive K8s Outages: Scale, Redundancy, and Embrace Restarts

Background: Didi Kubernetes upgrade failure

On 27 November a large‑scale outage at Didi was traced to an online hot upgrade of a very large Kubernetes cluster. The upgrade jumped eight minor versions (from 1.12 released Sep 2018 to 1.20 released Dec 2020) and modified the kubelet code to avoid pod restarts. During the process all pods were killed, the new control‑plane metadata could not be rolled back, and the cluster remained unavailable for an extended period.

Principle 1 – Control scale by using multiple small clusters

When a cluster grows to thousands of nodes its “explosion radius” becomes large: a single unexpected event can cascade to a massive failure. Experience from Alibaba’s 5 K‑node ODPS deployment and from PolarDB’s PolarStore shows that keeping each logical resource pool to a few hundred nodes and adding new pools as demand grows limits the blast radius and simplifies operations.

Recommended practice:

Define a target size for a Kubernetes cluster (e.g., 200–500 nodes).

When capacity is needed, provision a new cluster rather than expanding the existing one.

Treat each cluster as an independent deployment unit and replicate workloads across clusters (active‑active).

Principle 2 – Treat every cluster as a single point of failure

Even though a Kubernetes cluster provides intra‑cluster high availability, the control plane itself is a single logical failure domain. If the control plane crashes, all workloads, regardless of replica count, become inaccessible. Therefore:

Deploy stateless services and their backing databases in at least two clusters.

Use a multi‑cluster operator (e.g., a custom controller that watches resources in several clusters) to manage database provisioning, scaling, and failover.

Design network connectivity (load balancers, underlay IPs, VPC routing) to route traffic between clusters without coupling the application logic to a specific cluster.

Principle 3 – Embrace pod restarts and migrations as normal

Kubernetes upgrades inevitably evict pods, restart them on newer nodes, and may move them across node groups. The “cattle” mindset treats these events as expected and automates handling.

Typical upgrade flow (rolling upgrade):

# Drain a node safely
kubectl drain node1 \
  --ignore-daemonsets \
  --delete-emptydir-data
# Upgrade the node OS / kubelet / kube‑apiserver version
# After upgrade, bring the node back
kubectl uncordon node1

Blue‑green upgrade (e.g., on AWS EKS): create a new node group with the target version, shift pods to the new group, then delete the old group. Both methods require:

Automation that watches for pod termination events and re‑creates them.

Rollback scripts that can revert to the previous version if health checks fail.

Regular chaos or upgrade rehearsals to verify that the system remains functional during restarts.

Principle 4 – Decouple data‑plane availability from control‑plane availability

PolarStore (the storage layer of PolarDB) separates metadata management (control plane) from actual read/write operations (data plane). The data plane caches full metadata locally; control‑plane actions such as volume creation, resizing, or node‑failure‑driven data migration update the cache asynchronously. Consequently, even if the control plane is down, reads and writes continue.

In contrast, a tightly coupled design where master nodes handle both control tasks and data allocation creates a single point of failure: if a master crashes, the cluster cannot allocate new storage chunks, leading to a write outage. Mitigations for such designs include increasing master replica count (e.g., from three to five) and sharding master responsibilities, but the fundamental risk remains.

Key takeaway: design storage and other stateful services so that the data‑plane can operate independently of the control‑plane, or otherwise accept the cascading impact of control‑plane outages.

Conclusion

Applying these four principles—controlling cluster size, treating each cluster as a failure domain, designing for pod restarts, and decoupling data‑plane from control‑plane—reduces the probability and impact of large‑scale incidents. They are derived from a decade of experience building cloud services such as RDS and PolarDB and are directly applicable to modern Kubernetes deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeScalabilityKubernetesCluster Upgradefault tolerance
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.