Operations 19 min read

Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability

2023 witnessed numerous high‑profile cloud service failures—from Alibaba’s Hong Kong data‑center cooling issue to Tencent’s storage outage—highlighting how cost‑cutting, reduced staffing, and insufficient disaster‑recovery planning amplify risk, and outlining essential high‑availability, failover, and multi‑region strategies for resilient operations.

Tech Architecture Stories

Jan 25, 2024

Why 2023 Saw a Spike in Cloud Outages: Key Lessons for High‑Availability

Since cost‑cutting and efficiency drives have swept the internet industry, online incidents have multiplied, especially in 2023. Notable failures include:

December 18, 2022 – Alibaba Cloud Hong Kong PCCW data‑center cooling equipment failure : Over 12 hours of outage, affecting many public services in Macau; the incident is regarded as Alibaba Cloud’s biggest embarrassment.

March 29, 2023 – Tencent Cloud Guangzhou "Yun Gu" data‑center cooling equipment failure : Caused widespread impact on WeChat and QQ, classified as a top‑level incident with accountability reaching senior management.

October 25, 2023 – Yuque outage lasting more than 8 hours : A bug in the operations tool took storage nodes offline; recovery required data backup restoration and took over two hours for verification, exposing insufficient drills and weak RPO/RTO guarantees.

November 12, 2023 – Alibaba Cloud console outage : Over 5 hours of downtime.

November 27, 2023 – Didi outage lasting into the next day : Attributed to a core Kubernetes failure; multiple data‑centers could not sustain load.

December 3, 2023 – Tencent Video outage : Paid‑member video playback failed due to a storage failure that triggered a cascade.

Why 2023 Had So Many Failures

Personnel Changes and Resource Compression from Cost‑Cutting

Rumors suggest that internet companies laid off many developers in 2022‑2023, shrinking teams from five‑six engineers per system to just one. The remaining engineer must handle daily development, system stability, and cost reduction, often deploying micro‑services with a single CPU core and one replica, leaving no headroom for traffic spikes.

Reduced staff and compressed machine resources inevitably lead to stability problems; buying safety becomes the conventional remedy.

Increasing Business Complexity

Most incidents involve long‑running legacy services. With fewer developers, each person must maintain more micro‑services, raising complexity. Accumulated technical debt finally surfaces.

High‑Availability and Disaster Recovery Considerations

Programmers must design for failure and govern existing architectures. Two key terms are Failover and Failback —the ability to automatically switch to a reliable backup system when a component fails.

Failover is the capability to seamlessly switch to a reliable backup system, reducing or eliminating negative impact on users when a component or primary system fails.

Disaster Recovery (DR) originates from the concept of Disaster Recovery standards, which are rigorously defined in national guidelines.

Common root causes of major incidents include:

Changes (code, configuration, releases)

Program bugs

Capacity shortages due to insufficient provisioning, leading to overload in services or upstream/downstream dependencies.

Most severe outages stem from a bug or change that triggers a traffic surge (e.g., retries), causing a cascade, or from sudden external traffic spikes.

Layered Disaster‑Recovery Capabilities

Endpoint App Layer

Prevent uncontrolled retries (use exponential backoff, limit attempts, retry only on specific error codes).

Disable pre‑fetching during backend failures.

Drop non‑critical requests (e.g., telemetry, bulk operations).

Cache page‑level data to avoid repeated backend calls.

Coordinate cross‑region or cross‑operator traffic steering with the access layer.

Access Layer

The access layer sits between user networks and business logic, handling authentication, protocol conversion, and routing. Its reliability is critical because a failure impacts all users.

Collaborate with global gateways and load balancers for cross‑region and availability‑zone traffic steering.

Implement overload protection and rapid auto‑scaling, with throttling when capacity is exceeded.

Graceful degradation: bypass non‑essential downstream services when they fail.

Cache downstream responses to reduce traffic penetration.

Business Logic Layer

Micro‑services handling core logic should support:

Rapid scaling (leveraging cloud‑native HPA/VPA, capacity pools, and global resource scheduling).

Graceful degradation that remains invisible to users or offers reduced quality (e.g., lower image resolution).

Lossy overload protection, including adaptive rate limiting based on CPU, latency, or custom policies (e.g., Netflix’s concurrency‑limits, Alibaba’s Sentinel, TCP congestion algorithms like BBR).

Storage Layer

Using Redis cache and MySQL as examples:

Deploy multi‑AZ configurations (e.g., one primary, two replicas across zones) to ensure availability after a zone failure.

Enable global replication for cross‑region disaster recovery.

For Tencent Cloud TDSQL, employ cross‑region disaster‑backup instances, read‑only replicas for proximity, and multi‑region deployment.

Cross‑City / Cross‑AZ Disaster Recovery

Traditional telecom and finance used a "two‑site three‑center" model with cold standby. Modern internet services adopt active/active or active/passive multi‑site architectures, leveraging cloud concepts of regions and availability zones.