Operations 8 min read

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

On April 8, a Tencent Cloud API outage caused console login failures for nearly 2,000 customers, affecting several dependent services for 87 minutes, and the detailed root‑cause analysis and subsequent improvement actions are presented to enhance system resilience and change management.

Wukong Talks Architecture
Wukong Talks Architecture
Wukong Talks Architecture
Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

At 15:23 on April 8, Tencent Cloud’s monitoring system detected an anomaly in the Cloud API service, leading to widespread console login failures reported by customers.

The investigation revealed that the API exception prevented the console from functioning, and also impacted API‑dependent public cloud services such as Cloud Functions, OCR, microservice platforms, audio content safety, and verification code services.

The outage lasted about 87 minutes, during which 1,957 customers reported incidents. From a customer perspective, cloud services are divided into data plane (carrying business workloads) and control plane (managing resources). The console and Cloud API belong to the control plane.

While IaaS resources (e.g., servers) remained unaffected, PaaS/SaaS services that do not rely on the API continued normal operation. Traffic trends for all products showed no significant change, but API‑based services like Cloud Storage exhibited noticeable fluctuations.

Figure 1 shows the overall product traffic trend, and Figure 2 displays the storage service call trend, both indicating the impact of the API failure.

Problem Review

The handling process was as follows:

1. 15:23 – Fault detected; immediate service recovery and cause investigation began.

2. 15:47 – Rolling back the version did not fully restore service; deeper diagnosis started.

3. 15:57 – Root cause identified as erroneous configuration data; an emergency data‑repair plan was designed.

4. 16:02 – Data repair executed across all regions; API services began recovering region by region.

5. 16:05 – All regions except Shanghai recovered; further investigation on Shanghai continued.

6. 16:25 – Detected API circular dependency in Shanghai; traffic was shifted to other regions to restore service.

7. 16:45 – Shanghai recovered; API and dependent PaaS services were fully restored, and console traffic was scaled up ninefold.

8. 16:50 – Request volume returned to normal; services stabilized and the console was fully restored.

9. 17:45 – After one hour of observation with no issues, the incident was closed.

The root cause was insufficient forward‑compatibility testing of a new API version and inadequate gray‑release mechanisms, which generated incorrect configuration data that quickly propagated across all regions.

During rollback, a circular dependency between the API service and its container platform prevented automatic restart, requiring manual intervention to bring the API back online.

Improvement Measures

1. Enhance system resilience by regularly conducting change‑simulation drills, optimizing deployment architecture to avoid circular dependencies, and providing an API escape route for rapid failover.

2. Strengthen change management with comprehensive automated test suites, phased gray‑release strategies per cluster/zone/region, and automatic circuit‑breaker mechanisms to halt problematic changes.

3. Improve fault response and communication by upgrading incident handling workflows, delivering clear external notifications about impact scope and root cause, and redesigning the status page to reduce reliance on the affected services.

operationsIncident ResponseAPIcloudservice reliabilityroot cause analysisTencent Cloud
Wukong Talks Architecture
Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.