What Caused the 87‑Minute Tencent Cloud API Outage and How It Was Fixed?
On April 8, 2024, a cloud API failure disrupted Tencent Cloud's console for 87 minutes, affecting 1,957 customers, prompting a detailed incident review that uncovered version‑compatibility and configuration‑data issues and led to a set of operational improvements.
At 15:23 on April 8, 2024, Tencent Cloud detected an alarm indicating that the cloud API service was abnormal, causing many customers to be unable to log into the console. The incident lasted nearly 87 minutes, during which 1,957 customers reported problems.
The cloud API is a unified set of open interfaces that allows programmatic management of cloud resources; the console relies on these APIs for its interactive functions. Because the API was down, control‑plane services such as cloud functions, OCR, microservice platforms, audio content security, and verification codes became unavailable, while data‑plane resources (e.g., already deployed IaaS servers) remained unaffected.
From a customer perspective, cloud services consist of a data plane that carries business workloads and a control plane that manages those resources. The outage impacted only the control plane.
Problem Review
15:23 – Fault detected, immediate recovery actions started.
15:47 – Rolling back the version did not fully restore service; further investigation began.
15:57 – Root cause identified as erroneous configuration data; an emergency data‑repair plan was designed.
16:02 – Data repair executed across all regions; API services began recovering region by region.
16:05 – All regions except Shanghai recovered; focus shifted to Shanghai.
16:25 – Discovered a circular dependency in Shanghai’s technical components; traffic was redirected to other regions.
16:45 – Shanghai region recovered; API and dependent PaaS services were fully restored, but console traffic surged, prompting a nine‑fold capacity expansion.
16:50 – Request volume returned to normal, business stabilized, and console services were fully restored.
17:45 – After one hour of observation with no issues, the incident was closed.
The root cause was insufficient forward compatibility in the new cloud API version and inadequate gray‑release mechanisms, which allowed erroneous configuration data to propagate quickly across all regions.
During the upgrade, changes to the interface protocol broke the handling of legacy data, generating faulty configuration entries. The lack of a robust gray‑release process let these errors spread system‑wide, causing the API outage.
Rollback to the previous version and a manual restart of the API backend were required because the container platform that schedules the API also depended on it, creating a circular dependency that prevented automatic recovery.
Improvement Measures
Enhance system resilience by regularly conducting simulated change‑strategy drills and adopting layered architectures with code reviews and monitoring to avoid circular dependencies.
Provide an API escape route for rapid failover when faults occur.
Strengthen change management with extensive automated test suites, sandbox validation, and staged gray‑release strategies per cluster, zone, and region.
Introduce automatic circuit‑breaker mechanisms to halt problematic changes immediately.
Upgrade fault‑response processes to ensure real‑time progress updates, transparent communication of impact scope, root cause, and expected recovery time.
Improve the Tencent Cloud status dashboard by reducing reliance on the cloud API, adding caching and disaster‑recovery mechanisms to ensure accurate, timely incident information.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
