Tagged articles

incident analysis

25 articles · Page 1 of 1

Dec 31, 2025 · Operations

12 Major 2025 Internet Outages: What Every Ops Team Can Learn

This article compiles twelve high‑profile internet service failures from 2025, detailing each incident’s description, micro‑scenario, technical root cause, and risk perspective, and extracts actionable lessons on infrastructure resilience, change management, and security‑aware operations.

Internet OutagesOperationsReliability

0 likes · 20 min read

12 Major 2025 Internet Outages: What Every Ops Team Can Learn

Architect

Nov 24, 2025 · Operations

What Caused the Massive Cloudflare Outage on Nov 18 2025? A Deep Technical Breakdown

On the night of November 18 2025, Cloudflare suffered a three‑hour core failure that crippled roughly half of the internet, and this article details the timeline, global impact, root cause in a ClickHouse permission change, and the remediation steps taken to restore service.

Bot ManagementCDNCloudflare

0 likes · 10 min read

What Caused the Massive Cloudflare Outage on Nov 18 2025? A Deep Technical Breakdown

dbaplus Community

Nov 19, 2025 · Operations

Why Did Cloudflare’s Global Outage Happen on Nov 18 2025? Inside the Bot Management Bug

On the night of November 18 2025, Cloudflare suffered a worldwide outage that crippled services like ChatGPT, X, Spotify, and major gaming platforms, and a detailed post‑mortem reveals that a ClickHouse permission change caused an oversized bot‑management configuration file to crash edge nodes.

Bot ManagementCDNClickHouse

0 likes · 9 min read

Why Did Cloudflare’s Global Outage Happen on Nov 18 2025? Inside the Bot Management Bug

Tech Freedom Circle

Aug 20, 2025 · Backend Development

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

During the Double‑11 shopping festival, a leading e‑commerce platform suffered a P0 outage when its Eureka service‑discovery cluster overloaded, triggering a full‑chain failure that lasted 2 hours 42 minutes and caused losses exceeding 1.2 billion yuan; the article dissects the timeline, root causes, capacity mis‑planning, monitoring gaps, and remediation strategies.

JavaMicroservicesMonitoring

0 likes · 34 min read

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

IT Services Circle

Jun 20, 2025 · Cloud Computing

Why Did Google Cloud Crash and What It Means for Multi‑Cloud Strategies

A massive outage on June 12, 2025 crippled Google Cloud, AWS, and Azure, exposing the hidden risks of multi‑cloud architectures as a simple NullPointerException cascaded into a global digital infrastructure failure.

AWSAzureGoogle Cloud

0 likes · 4 min read

Why Did Google Cloud Crash and What It Means for Multi‑Cloud Strategies

Open Source Linux

Jan 18, 2025 · Operations

What Caused Alipay’s 5‑Minute P0 Outage and How Much Was Lost?

The article dissects Alipay’s rare P0 incident on January 16 2025, explaining how a misconfigured marketing template triggered a 20% discount for all transactions, detailing the rapid five‑minute fix, estimating the financial loss at roughly 14 million yuan, and outlining operational lessons and accountability.

Operationsdeployment riskfinancial loss

0 likes · 11 min read

What Caused Alipay’s 5‑Minute P0 Outage and How Much Was Lost?

System Architect Go

Dec 19, 2024 · Operations

Why Did OpenAI’s New Telemetry Crash Their Kubernetes Cluster?

On December 11, 2024 OpenAI’s Kubernetes cluster suffered a four‑hour outage after a newly deployed telemetry service generated massive API traffic from every node, overwhelming the kube‑apiserver, breaking DNS‑based service discovery, and exposing gaps in control‑plane monitoring and break‑glass mechanisms, prompting critical questions about component behavior and configuration.

API overloadControl PlaneDNS

0 likes · 8 min read

Why Did OpenAI’s New Telemetry Crash Their Kubernetes Cluster?

FunTester

Nov 13, 2024 · Industry Insights

What Caused Alipay’s Double‑11 Outage? Inside the System Message Library Failure

On Double 11, Alipay suffered a multi‑hour outage that prevented payments, Yebao withdrawals, and Huabei repayments, prompting an official apology that blamed a partial failure of its system message library—a critical database for storing and routing system messages—highlighting hardware, software, network, and data factors behind the incident.

AlipaySystem Message LibraryTechnology Failures

0 likes · 8 min read

What Caused Alipay’s Double‑11 Outage? Inside the System Message Library Failure

Programmer DD

Aug 16, 2024 · Backend Development

How a Hidden Uint Overflow Triggered Massive Traffic Spikes and the Memory‑Leak Mystery I Solved

This article recounts a developer's journey from a fresh graduate to a senior backend engineer, detailing two real‑world incidents—a pseudo‑memory‑leak in a C++ service and a uint overflow that caused traffic bursts—showing the analysis steps, code fixes, and lessons learned for reliable backend development.

C++Performance Optimizationincident analysis

0 likes · 19 min read

How a Hidden Uint Overflow Triggered Massive Traffic Spikes and the Memory‑Leak Mystery I Solved

DevOps Operations Practice

May 20, 2024 · Cloud Computing

Google Cloud Data Deletion Incident at UniSuper: Causes, Impact, and Lessons Learned

Google Cloud mistakenly deleted data and backups for Australian pension fund UniSuper, causing over 600,000 members to lose access for more than a week, and the incident highlights the risks of single‑provider reliance, the importance of robust backup strategies, and the growing relevance of hybrid and multi‑cloud architectures.

Cloud ComputingData lossGoogle Cloud

0 likes · 5 min read

Google Cloud Data Deletion Incident at UniSuper: Causes, Impact, and Lessons Learned

dbaplus Community

Jan 14, 2024 · Operations

How AI-Driven Event Intelligence Transforms Data Center Fault Management

The article explains the design and functionality of an AI‑enhanced event intelligent analysis system that automates fault identification, analysis, and remediation in data‑center operations, detailing its architecture, integration with monitoring, CMDB, ITSM, big‑data platforms, and the AI techniques that enable automatic modeling, clustering, and knowledge‑base retrieval.

AIAutomationBig Data

0 likes · 18 min read

How AI-Driven Event Intelligence Transforms Data Center Fault Management

Su San Talks Tech

Dec 6, 2023 · Operations

What Went Wrong in Didi’s 12‑Hour Outage? Lessons on Kubernetes Upgrades and Cost‑Cutting

An in‑depth review of Didi’s 12‑hour P0 outage reveals how a mistaken Kubernetes version downgrade during an in‑place upgrade caused master node failure, discusses cluster isolation, upgrade strategies, and the role of cost‑cutting pressures, offering practical lessons for large‑scale operations.

Operationscluster upgradecost management

0 likes · 7 min read

What Went Wrong in Didi’s 12‑Hour Outage? Lessons on Kubernetes Upgrades and Cost‑Cutting

Code Ape Tech Column

Dec 4, 2023 · Cloud Native

Analysis of Didi’s Kubernetes Outage and General Mitigation Strategies

The article reviews Didi’s 12‑hour P0 outage caused by a Kubernetes upgrade failure in a massive cluster, discusses the root causes, and proposes general solutions such as federation, careful upgrade planning, and multi‑master designs to avoid similar incidents.

cluster scalingincident analysiskubernetes

0 likes · 8 min read

Analysis of Didi’s Kubernetes Outage and General Mitigation Strategies

Java Captain

Nov 30, 2023 · Operations

Analysis of Didi's November 2023 System Outage and Potential Technical Causes

The article reviews Didi's late‑November 2023 service disruption, detailing the timeline of failures, official apologies, and expert analyses of six possible technical causes—including software bugs, server issues, third‑party failures, DDoS, other attacks, and ransomware—while highlighting the role of a Kubernetes upgrade and cost‑cutting pressures.

Cloud NativeDidiOperations

0 likes · 7 min read

Analysis of Didi's November 2023 System Outage and Potential Technical Causes

Ops Development Stories

Jun 6, 2023 · Operations

When Fancy PPTs Meet Real Outages: Lessons from a Major E‑commerce Crash

The article examines Vipshop's massive March 2023 outage caused by an IDC cooling failure, critiques superficial PPT‑driven reliability claims, and offers practical SRE insights on fault drills, true multi‑active architectures, and how ops teams can gain influence despite budget constraints.

OperationsSREfault tolerance

0 likes · 7 min read

When Fancy PPTs Meet Real Outages: Lessons from a Major E‑commerce Crash

dbaplus Community

Oct 25, 2022 · Operations

How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws

A government information system suffered a week of instability, including service deadlocks, Tomcat memory overflows, and load‑balancing failures, prompting a deep forensic analysis that uncovered database lock‑ups, faulty front‑end loops, inadequate monitoring, and misconfigured logging, leading to concrete remediation steps and lessons for future reliability.

Operationsincident analysisload balancing

0 likes · 21 min read

How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws

Efficient Ops

Feb 10, 2022 · Operations

Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?

A production incident on an elastic‑cloud deployment revealed that setting the JVM Metaspace limit to 64 MiB, while the application required around 76 MiB, triggered continuous Full GC, causing stop‑the‑world pauses, full‑line time‑outs, and a costly rollback.

Elastic CloudGCJVM

0 likes · 9 min read

Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?

ITPUB

Jan 19, 2022 · Databases

How a Startup Solved Midnight MySQL Timeouts: Slow‑SQL Diagnosis & Caching

During nightly peaks, a social‑e‑commerce startup experienced hour‑long service outages due to MySQL timeouts; by analyzing traffic spikes, CPU usage, and slow‑SQL logs, the team identified un‑cached ranking queries and a 20‑minute cache refresh bottleneck, then implemented targeted caching, monitoring scripts, and fallback static pages to eliminate the issue.

Cachingincident analysismysql

0 likes · 14 min read

How a Startup Solved Midnight MySQL Timeouts: Slow‑SQL Diagnosis & Caching

Programmer DD

Dec 22, 2021 · Operations

Why Did Xi’an’s Health‑Code App Crash? A Deep Dive into the Failure

The article analyzes the Xi’an “Yima Tong” health‑code system outage, detailing the symptoms, root‑cause factors such as rate‑limiting gaps, server overload, architectural coupling, and ISP differences, and then offers short‑term, long‑term, design, high‑availability, and testing recommendations to prevent future crashes.

Cloud Nativeincident analysisperformance

0 likes · 13 min read

Why Did Xi’an’s Health‑Code App Crash? A Deep Dive into the Failure

Efficient Ops

Sep 23, 2021 · Operations

Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC

The article recounts a staged rollout of the Maybach service on elastic cloud, details the timeline of successful and failing deployments, analyzes JVM metrics revealing excessive Metaspace usage that triggered continuous full garbage collections, and explains how this caused system‑wide timeouts and a half‑hour outage.

JVMMetaspaceOperations

0 likes · 10 min read

Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC

Code Ape Tech Column

Jul 15, 2021 · Operations

What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure

The article analyzes Bilibili's recent half‑hour service disruption, explores technical rumors such as an etcd crash, examines Kubernetes‑based cloud‑native infrastructure, reviews similar historic outages, and offers expert recommendations for improving high‑availability and disaster‑recovery in large‑scale internet services.

BilibiliCloud NativeEtcd

0 likes · 8 min read

What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure

Byte Quality Assurance Team

Dec 16, 2020 · Backend Development

Live Streaming Service Overload Incident Caused by Self-Referencing Push Configuration

A sudden surge in live‑stream traffic overloaded the core streaming service because a push configuration mistakenly pointed to the same stream URL, creating a self‑referencing loop that repeatedly generated duplicate streams until the service capacity was exhausted.

Live Streamingbackend bugincident analysis

0 likes · 4 min read

Live Streaming Service Overload Incident Caused by Self-Referencing Push Configuration

dbaplus Community

Jun 24, 2019 · Operations

Why Did Our Payment System Auto‑Recover? A Deep Dive into Queue Backlog and Transaction Locks

A new employee at an OTA company faced a mysterious outage where thousands of payment‑related messages piled up in the queue, the system auto‑recovered, and a detailed investigation revealed a stuck MySQL transaction caused by missing response timeout settings, leading to lock contention and message backlog.

HttpClientMessage Queueincident analysis

0 likes · 7 min read

Why Did Our Payment System Auto‑Recover? A Deep Dive into Queue Backlog and Transaction Locks

Baidu Intelligent Testing

Apr 5, 2016 · Operations

Hot Reload: Common Pitfalls and How to Avoid Them

This article examines the hidden risks of hot‑reload mechanisms in web services, illustrates real incidents caused by careless configuration updates, analyzes root causes, and offers practical steps for detecting and fixing such pitfalls to improve operational reliability.

Software Operationsconfiguration managementhot-reload

0 likes · 7 min read

Hot Reload: Common Pitfalls and How to Avoid Them

ITPUB

May 29, 2015 · Information Security

What Really Happened to Ctrip’s Database? A Technical Deep‑Dive into the Attack and Backup Risks

The article examines the Ctrip outage by analyzing observed symptoms, evaluating the likelihood of a large‑scale database attack versus node failures, and discussing how backup strategies and private‑cloud storage could affect data recovery in such a severe security incident.

Ctripbackup strategycloud storage

0 likes · 7 min read

What Really Happened to Ctrip’s Database? A Technical Deep‑Dive into the Attack and Backup Risks