Operations 14 min read

Alibaba Cloud Hong Kong Region Outage Postmortem – December 18, 2023

On December 18, 2023, Alibaba Cloud's Hong Kong Region experienced a severe cooling‑system failure that caused a 14‑hour outage of ECS, OSS, EBS, RDS and other services, prompting extensive emergency procedures, service impact analysis, and a detailed post‑mortem with improvement actions.

Wukong Talks Architecture

Dec 26, 2022

Alibaba Cloud Hong Kong Region Outage Postmortem – December 18, 2023

Process

At 08:56 on December 18, Alibaba Cloud monitoring detected a temperature alarm in the C‑zone of the Hong Kong Region data center; engineers intervened and notified the on‑site service provider for inspection.

At 09:01 another temperature rise alarm was triggered, revealing an abnormal chiller.

At 09:09 the service provider attempted a 4+4 primary‑backup switch and reboot of the faulty chiller, but the operation failed and the cooling unit could not recover.

At 09:17 the emergency cooling plan was activated, auxiliary ventilation was applied, and engineers tried to isolate and manually restore each chiller, but stability could not be achieved, prompting further on‑site investigation.

From 10:30, to avoid potential fire hazards due to high temperature, engineers gradually throttled compute, storage, network and database workloads across the data center while continuing attempts to restore the chillers.

At 12:30 the chiller vendor arrived; after manual water‑top‑up and air‑bleeding the system still could not stay stable, leading engineers to shut down high‑temperature compartments.

At 14:47 a forced fire‑sprinkler was triggered in one compartment due to excessive heat.

At 15:20 the vendor manually adjusted configurations, unlocking the chiller group control; the first chiller returned to normal and temperature began to drop, followed by the others.

By 18:55 all four chillers restored normal cooling capacity.

At 19:02 servers were restarted in batches with temperature monitoring.

At 19:47 the data center temperature stabilized; engineers resumed service restoration and performed necessary data integrity checks.

By 21:36 most servers were back online after thorough data safety verification.

At 22:50 final data checks and risk assessments were completed, and the last compartment was powered up safely.

Service Impact

From 09:23, ECS instances in the C‑zone began failing, triggering zone‑wide migration; as temperature rose, more servers shut down, affecting EBS, OSS, RDS and other services.

The fault did not affect workloads in other zones, but the control plane for Hong Kong Region ECS was impaired.

Starting at 14:49, ECS control services were throttled, dropping availability to as low as 20%; custom‑image based instance creation sometimes failed because the required OSS service was unavailable.

DataWorks and Kubernetes console operations were also impacted; full API recovery occurred at 23:11.

At 10:37 OSS in the C‑zone experienced downtime; prolonged high temperature risked disk failures, leading to a service interruption from 11:07 to 18:26.

Alibaba Cloud offers two OSS types: local‑redundancy LRS (single‑AZ) deployed only in zone C, and city‑redundancy ZRS (multi‑AZ) deployed in zones B, C and D. The ZRS service remained largely unaffected, while the LRS service suffered extended outage due to lack of cross‑zone failover.

From 18:26 storage servers were restarted in batches; some LRS servers required isolation because of fire‑sprinkler activation, and extensive data integrity verification was performed before service could be restored.

The LRS service only fully resumed external access at 00:30 on December 19.

Some single‑zone network products (VPN, Privatelink, a few GA instances) were also affected.

At 11:21 engineers initiated zone‑level disaster‑escape for network products; by 12:45 most network services (SLB, etc.) completed escape, and NAT finished at 13:47, with only minute‑level impact.

From 10:17, RDS instances in zone C began reporting unavailability; cross‑zone migration was performed for most MySQL, Redis, MongoDB and DTS instances, while some single‑zone HA instances required manual address switching.

By around 21:30 the majority of database instances recovered; for remaining single‑node or HA instances, clone and migration solutions were offered, though some required extended handling.

Customers with multi‑zone deployments maintained service continuity; high‑availability customers are advised to adopt full‑link multi‑zone architectures.

Problem Analysis and Improvement Measures

1. Prolonged chiller recovery time Cause: Water shortage created air blockage in the cooling circuit, preventing four primary chillers from operating and causing failure of the backup chillers. Manual water‑top‑up and air‑bleeding were required, and the group‑control logic prevented independent startup of a single chiller, extending recovery time (3 h 34 min for diagnosis, 2 h 57 min for water‑top‑up, 3 h 32 min for unlocking group control).

Improvement: Conduct comprehensive checks of data‑center infrastructure control systems, expand monitoring coverage and granularity, and ensure automatic switch‑over logic works correctly while maintaining accurate manual override procedures to avoid state deadlocks.

2. Delayed on‑site handling leading to fire‑sprinkler activation Cause: Cooling failure raised compartment temperatures to the sprinkler trigger threshold, causing water damage to power cabinets and hardware.

Improvement: Strengthen data‑center service‑provider management, refine temperature‑rise response plans, define clear shutdown actions for both business side and data‑center, and reinforce execution through regular drills.

3. Failure of new ECS instance creation in Hong Kong region Cause: ECS control plane relies on dual‑zone (B + C) disaster‑recovery; after the C‑zone fault, traffic from new instance creation overloaded the B‑zone control service, and custom‑image data depended on the single‑AZ OSS service in zone C.

Improvement: Perform full‑network inspection, optimize multi‑AZ high‑availability design to avoid single‑AZ OSS and middleware dependencies, and enhance control‑plane disaster‑recovery drills.

4. Insufficiently timely and transparent fault communication Cause: Although Alibaba Cloud used DingTalk groups and announcements, the slow progress of cooling repairs resulted in delayed updates; the status page was not refreshed promptly, causing customer confusion.

Improvement: Accelerate rapid assessment and identification of fault impact, launch an upgraded service health status page, and provide faster, clearer communication channels for customers to track incident effects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud computing incident management Alibaba Cloud Outage

Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.