Operations 7 min read

What Went Wrong in Didi’s 12‑Hour Outage? Lessons on Kubernetes Upgrades and Cost‑Cutting

An in‑depth review of Didi’s 12‑hour P0 outage reveals how a mistaken Kubernetes version downgrade during an in‑place upgrade caused master node failure, discusses cluster isolation, upgrade strategies, and the role of cost‑cutting pressures, offering practical lessons for large‑scale operations.

Su San Talks Tech

Dec 6, 2023

What Went Wrong in Didi’s 12‑Hour Outage? Lessons on Kubernetes Upgrades and Cost‑Cutting

Hello everyone, I am Su San.

A few weeks ago Didi experienced a P0 incident that disrupted services for 12 hours.

Fault Review

Online rumors claim that during a Kubernetes upgrade the operations team planned to upgrade from version 1.12 to 1.20 but mistakenly selected the wrong version, resulting in a downgrade of the cluster.

The Didi technical blog outlines their upgrade plan:

Upgrade Method

To reduce upgrade cost, Didi chose an in‑place upgrade: first the master, then the nodes. Below is the official Kubernetes architecture.

The master (called the control plane) consists of three key components:

cloud‑controller‑manager: responsible for container orchestration;

kube‑api‑server: provides API registration services for node nodes;

scheduler: handles task scheduling.

Only after a node successfully registers with the kube‑api‑server can it run Pods. In Didi’s case, after upgrading the master, nodes were re‑registered gradually, a process that should be quick and invisible to users if rehearsed properly.

However, when the master version was mistakenly downgraded, the kube‑api‑server became polluted, causing node registration failures. The nodes were marked unhealthy, their Pods were killed, and the service stopped.

Cluster Isolation

This incident sparked discussion about Kubernetes cluster isolation. Multiple business lines (e.g., ride‑hailing, bike‑sharing) were running on the same cluster, indicating that the cluster size had far exceeded the community‑recommended limit of 5,000 nodes.

In the early growth stage, teams often placed many services into a single cluster for rapid rollout, later considering splitting but abandoning the plan because the existing cluster could still support the load.

Splitting into multiple clusters offers clear benefits: business isolation, fault isolation, and increased reliability. For example, one could pilot an upgrade on a less critical, low‑traffic cluster before rolling it out to others.

The downside is increased operational complexity and cost.

Upgrade Plan

Having participated in large‑scale platform refactoring, I have rarely seen in‑place upgrades chosen because architects prefer more thorough rewrites. Their main considerations are:

In‑place upgrades are less disruptive than complete rewrites, but they may not achieve the same depth of improvement;

Minimizing business impact usually involves gradual, gray‑release traffic shifting;

Demonstrating a successful upgrade can showcase the team’s productivity.

For a company the size of Didi, the operations team likely validates any upgrade strategy repeatedly; if the correct version is chosen, in‑place upgrades can be safe.

Cost Reduction and Efficiency

Many speculated that the outage was caused by cost‑cutting measures that replaced senior operations staff with less experienced personnel.

Data shows Didi has indeed reduced staff in recent years, but this is not the direct cause of the incident.

During rapid growth, heavy investment in technical staff is necessary to build systems. As the market matures, the need for a large engineering workforce diminishes, and a leaner team can maintain the stable system.

Thus, cost‑reduction is inevitable after a business stabilizes. The loss from a 12‑hour outage can far exceed the salary of a thousand engineers.

For developers, joining a fast‑growing company can be lucrative, but it’s wise to focus on the value you bring rather than assuming technical prowess alone guarantees long‑term security.

Conclusion

This article analyzes the rumored causes of Didi’s outage, examines upgrade strategies and cost‑reduction pressures, and reminds readers to maintain robust systems to avoid severe incidents that could affect performance evaluations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations kubernetes cluster upgrade cost management incident analysis

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.