Tagged articles
25 articles
Page 1 of 1
Ops Development Stories
Ops Development Stories
Dec 31, 2025 · Operations

12 Major 2025 Internet Outages: What Every Ops Team Can Learn

This article compiles twelve high‑profile internet service failures from 2025, detailing each incident’s description, micro‑scenario, technical root cause, and risk perspective, and extracts actionable lessons on infrastructure resilience, change management, and security‑aware operations.

Internet OutagesOperationsReliability
0 likes · 20 min read
12 Major 2025 Internet Outages: What Every Ops Team Can Learn
Tech Freedom Circle
Tech Freedom Circle
Aug 20, 2025 · Backend Development

P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11

During the Double‑11 shopping festival, a leading e‑commerce platform suffered a P0 outage when its Eureka service‑discovery cluster overloaded, triggering a full‑chain failure that lasted 2 hours 42 minutes and caused losses exceeding 1.2 billion yuan; the article dissects the timeline, root causes, capacity mis‑planning, monitoring gaps, and remediation strategies.

JavaMicroservicescapacity planning
0 likes · 34 min read
P0 Eureka Service Discovery Collapse Cost a Top E‑commerce $120M During Double‑11
Open Source Linux
Open Source Linux
Jan 18, 2025 · Operations

What Caused Alipay’s 5‑Minute P0 Outage and How Much Was Lost?

The article dissects Alipay’s rare P0 incident on January 16 2025, explaining how a misconfigured marketing template triggered a 20% discount for all transactions, detailing the rapid five‑minute fix, estimating the financial loss at roughly 14 million yuan, and outlining operational lessons and accountability.

Operationsdeployment riskfinancial loss
0 likes · 11 min read
What Caused Alipay’s 5‑Minute P0 Outage and How Much Was Lost?
System Architect Go
System Architect Go
Dec 19, 2024 · Operations

Why Did OpenAI’s New Telemetry Crash Their Kubernetes Cluster?

On December 11, 2024 OpenAI’s Kubernetes cluster suffered a four‑hour outage after a newly deployed telemetry service generated massive API traffic from every node, overwhelming the kube‑apiserver, breaking DNS‑based service discovery, and exposing gaps in control‑plane monitoring and break‑glass mechanisms, prompting critical questions about component behavior and configuration.

API overloadControl PlaneDNS
0 likes · 8 min read
Why Did OpenAI’s New Telemetry Crash Their Kubernetes Cluster?
FunTester
FunTester
Nov 13, 2024 · Industry Insights

What Caused Alipay’s Double‑11 Outage? Inside the System Message Library Failure

On Double 11, Alipay suffered a multi‑hour outage that prevented payments, Yebao withdrawals, and Huabei repayments, prompting an official apology that blamed a partial failure of its system message library—a critical database for storing and routing system messages—highlighting hardware, software, network, and data factors behind the incident.

AlipaySystem Message LibraryTechnology Failures
0 likes · 8 min read
What Caused Alipay’s Double‑11 Outage? Inside the System Message Library Failure
Programmer DD
Programmer DD
Aug 16, 2024 · Backend Development

How a Hidden Uint Overflow Triggered Massive Traffic Spikes and the Memory‑Leak Mystery I Solved

This article recounts a developer's journey from a fresh graduate to a senior backend engineer, detailing two real‑world incidents—a pseudo‑memory‑leak in a C++ service and a uint overflow that caused traffic bursts—showing the analysis steps, code fixes, and lessons learned for reliable backend development.

C++Performance Optimizationincident analysis
0 likes · 19 min read
How a Hidden Uint Overflow Triggered Massive Traffic Spikes and the Memory‑Leak Mystery I Solved
DevOps Operations Practice
DevOps Operations Practice
May 20, 2024 · Cloud Computing

Google Cloud Data Deletion Incident at UniSuper: Causes, Impact, and Lessons Learned

Google Cloud mistakenly deleted data and backups for Australian pension fund UniSuper, causing over 600,000 members to lose access for more than a week, and the incident highlights the risks of single‑provider reliance, the importance of robust backup strategies, and the growing relevance of hybrid and multi‑cloud architectures.

BackupData lossGoogle Cloud
0 likes · 5 min read
Google Cloud Data Deletion Incident at UniSuper: Causes, Impact, and Lessons Learned
dbaplus Community
dbaplus Community
Jan 14, 2024 · Operations

How AI-Driven Event Intelligence Transforms Data Center Fault Management

The article explains the design and functionality of an AI‑enhanced event intelligent analysis system that automates fault identification, analysis, and remediation in data‑center operations, detailing its architecture, integration with monitoring, CMDB, ITSM, big‑data platforms, and the AI techniques that enable automatic modeling, clustering, and knowledge‑base retrieval.

AIAutomationBig Data
0 likes · 18 min read
How AI-Driven Event Intelligence Transforms Data Center Fault Management
Su San Talks Tech
Su San Talks Tech
Dec 6, 2023 · Operations

What Went Wrong in Didi’s 12‑Hour Outage? Lessons on Kubernetes Upgrades and Cost‑Cutting

An in‑depth review of Didi’s 12‑hour P0 outage reveals how a mistaken Kubernetes version downgrade during an in‑place upgrade caused master node failure, discusses cluster isolation, upgrade strategies, and the role of cost‑cutting pressures, offering practical lessons for large‑scale operations.

Cluster UpgradeCost ManagementKubernetes
0 likes · 7 min read
What Went Wrong in Didi’s 12‑Hour Outage? Lessons on Kubernetes Upgrades and Cost‑Cutting
Java Captain
Java Captain
Nov 30, 2023 · Operations

Analysis of Didi's November 2023 System Outage and Potential Technical Causes

The article reviews Didi's late‑November 2023 service disruption, detailing the timeline of failures, official apologies, and expert analyses of six possible technical causes—including software bugs, server issues, third‑party failures, DDoS, other attacks, and ransomware—while highlighting the role of a Kubernetes upgrade and cost‑cutting pressures.

Cloud NativeDidiOperations
0 likes · 7 min read
Analysis of Didi's November 2023 System Outage and Potential Technical Causes
dbaplus Community
dbaplus Community
Oct 25, 2022 · Operations

How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws

A government information system suffered a week of instability, including service deadlocks, Tomcat memory overflows, and load‑balancing failures, prompting a deep forensic analysis that uncovered database lock‑ups, faulty front‑end loops, inadequate monitoring, and misconfigured logging, leading to concrete remediation steps and lessons for future reliability.

OperationsTomcatincident analysis
0 likes · 21 min read
How a Government System’s Week‑Long Outage Exposed Critical Backend and Load‑Balancing Flaws
Efficient Ops
Efficient Ops
Feb 10, 2022 · Operations

Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?

A production incident on an elastic‑cloud deployment revealed that setting the JVM Metaspace limit to 64 MiB, while the application required around 76 MiB, triggered continuous Full GC, causing stop‑the‑world pauses, full‑line time‑outs, and a costly rollback.

Elastic CloudJVMMetaspace
0 likes · 9 min read
Why Did a Metaspace Misconfiguration Crash Our Elastic Cloud Service?
ITPUB
ITPUB
Jan 19, 2022 · Databases

How a Startup Solved Midnight MySQL Timeouts: Slow‑SQL Diagnosis & Caching

During nightly peaks, a social‑e‑commerce startup experienced hour‑long service outages due to MySQL timeouts; by analyzing traffic spikes, CPU usage, and slow‑SQL logs, the team identified un‑cached ranking queries and a 20‑minute cache refresh bottleneck, then implemented targeted caching, monitoring scripts, and fallback static pages to eliminate the issue.

cachingincident analysismysql
0 likes · 14 min read
How a Startup Solved Midnight MySQL Timeouts: Slow‑SQL Diagnosis & Caching
Programmer DD
Programmer DD
Dec 22, 2021 · Operations

Why Did Xi’an’s Health‑Code App Crash? A Deep Dive into the Failure

The article analyzes the Xi’an “Yima Tong” health‑code system outage, detailing the symptoms, root‑cause factors such as rate‑limiting gaps, server overload, architectural coupling, and ISP differences, and then offers short‑term, long‑term, design, high‑availability, and testing recommendations to prevent future crashes.

Cloud Nativeincident analysisperformance
0 likes · 13 min read
Why Did Xi’an’s Health‑Code App Crash? A Deep Dive into the Failure
Efficient Ops
Efficient Ops
Sep 23, 2021 · Operations

Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC

The article recounts a staged rollout of the Maybach service on elastic cloud, details the timeline of successful and failing deployments, analyzes JVM metrics revealing excessive Metaspace usage that triggered continuous full garbage collections, and explains how this caused system‑wide timeouts and a half‑hour outage.

Full GCJVMMetaspace
0 likes · 10 min read
Why Did Our New Deployment Crash? Uncovering Metaspace‑Induced Full‑GC
Code Ape Tech Column
Code Ape Tech Column
Jul 15, 2021 · Operations

What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure

The article analyzes Bilibili's recent half‑hour service disruption, explores technical rumors such as an etcd crash, examines Kubernetes‑based cloud‑native infrastructure, reviews similar historic outages, and offers expert recommendations for improving high‑availability and disaster‑recovery in large‑scale internet services.

BilibiliCloud NativeKubernetes
0 likes · 8 min read
What Really Caused Bilibili’s Sudden Outage? A Deep Dive into the Technical Failure
dbaplus Community
dbaplus Community
Jun 24, 2019 · Operations

Why Did Our Payment System Auto‑Recover? A Deep Dive into Queue Backlog and Transaction Locks

A new employee at an OTA company faced a mysterious outage where thousands of payment‑related messages piled up in the queue, the system auto‑recovered, and a detailed investigation revealed a stuck MySQL transaction caused by missing response timeout settings, leading to lock contention and message backlog.

HttpClientMessage Queueincident analysis
0 likes · 7 min read
Why Did Our Payment System Auto‑Recover? A Deep Dive into Queue Backlog and Transaction Locks
Baidu Intelligent Testing
Baidu Intelligent Testing
Apr 5, 2016 · Operations

Hot Reload: Common Pitfalls and How to Avoid Them

This article examines the hidden risks of hot‑reload mechanisms in web services, illustrates real incidents caused by careless configuration updates, analyzes root causes, and offers practical steps for detecting and fixing such pitfalls to improve operational reliability.

Configuration ManagementSoftware Operationshot-reload
0 likes · 7 min read
Hot Reload: Common Pitfalls and How to Avoid Them