Operations 13 min read

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

On December 18, 2022, Alibaba Cloud's Hong Kong Region Zone C suffered a massive service interruption—the longest in its operational history—prompting a detailed incident response, extensive service impact across compute, storage, and networking, and a thorough analysis that led to concrete infrastructure and communication improvements.

Programmer DD

Dec 26, 2022

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

Processing Timeline

08:56 – Monitoring detected temperature alarm in Zone C; engineers began emergency handling and notified the data‑center service provider.

09:01 – Multiple temperature alarms triggered; a cooling unit malfunction was identified.

09:09 – Attempted 4+4 primary‑backup switch and reboot of the faulty cooling unit failed.

09:17 – Emergency cooling plan activated; auxiliary ventilation started, but some servers began overheating.

10:30 – Load‑shedding of compute, storage, network, and database clusters to prevent fire hazards.

12:30 – Supplier performed manual water‑fill and venting on cooling towers; some servers were shut down.

14:47 – Fire‑sprinkler triggered in one compartment due to high temperature.

15:20 – Manual configuration adjustments restored one cooling unit; subsequent units recovered sequentially.

18:55 – Four cooling units returned to normal capacity.

19:02 – Servers restarted in batches while monitoring temperature.

19:47 – Temperature stabilized; service restoration and data integrity checks began.

21:36 – Most servers powered on; one compartment remained offline for data safety verification.

22:50 – Final risk assessment completed; remaining servers restored.

Service Impact

Starting at 09:23, ECS instances in Zone C began shutting down, triggering intra‑zone migration and affecting EBS, OSS, RDS and other services. The ECS control plane experienced throttling, dropping availability to 20 % after 14:49. Custom‑image‑based instance launches failed due to dependence on single‑AZ OSS. Storage (OSS) local‑redundancy (LRS) services were offline until 00:30 on December 19, while zone‑redundant ZRS remained largely unaffected. Network products (VPN, PrivateLink, some GA instances) saw limited impact, with NAT experiencing minute‑level outages. RDS, MySQL, Redis, MongoDB, and DTS instances underwent cross‑zone failover where possible; single‑AZ instances required manual migration or cloning.

Problem Analysis and Improvement Measures

Reason Analysis: The cooling system suffered water‑level blockage, causing four primary chillers to fail and preventing independent startup of backup units. Manual water‑fill and group‑control logic adjustments took over three hours each, extending recovery time.

Improvement Measures: Expand monitoring coverage, refine data collection granularity, and ensure automatic switch‑over logic works correctly while guaranteeing accurate manual overrides to avoid deadlocks.

Reason Analysis: Overheating triggered fire‑sprinkler activation, flooding power cabinets and damaging hardware, which further delayed recovery.

Improvement Measures: Strengthen data‑center service‑provider management, standardize temperature‑rise procedures, define clear shutdown actions for both customers and the facility, and conduct regular drills.

Reason Analysis: ECS control plane relied on resources in the failed zone, causing capacity shortages and custom‑image data service failures during new instance creation.

Improvement Measures: Perform a full‑network inspection, optimize multi‑AZ high‑availability designs to avoid single‑AZ dependencies, and enhance control‑plane disaster‑recovery drills.

Reason Analysis: Incident updates were delayed on the status page and internal channels, leading to customer confusion.

Improvement Measures: Accelerate impact assessment, launch an upgraded status‑page system, and provide timely, transparent communications about service effects.

Summary

The Hong Kong Region Zone C outage on December 18, 2022 was the longest large‑scale failure in Alibaba Cloud’s operational history, affecting compute, storage, database, and network services. Detailed incident handling, root‑cause analysis of cooling‑system failures, and a set of concrete improvement actions were documented to enhance infrastructure reliability, emergency response, and customer communication.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Reliability infrastructure Alibaba Cloud Outage Incident Report

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.