Tagged articles

Outage

30 articles · Page 1 of 1

Jan 4, 2026 · Operations

How a Missed Domain Renewal Crashed Our Site for 2 Hours – Full DNS Outage Postmortem

At 3:07 AM on August 15 2025 a critical alert indicated the entire site was inaccessible, leading to a 2‑hour, 500 k‑user outage caused by an expired domain that entered serverHold status, and this postmortem details the detection, root‑cause analysis, emergency recovery steps, and long‑term remediation measures.

DNSOutagedomain renewal

0 likes · 19 min read

How a Missed Domain Renewal Crashed Our Site for 2 Hours – Full DNS Outage Postmortem

Architect

Nov 24, 2025 · Operations

What Caused the Massive Cloudflare Outage on Nov 18 2025? A Deep Technical Breakdown

On the night of November 18 2025, Cloudflare suffered a three‑hour core failure that crippled roughly half of the internet, and this article details the timeline, global impact, root cause in a ClickHouse permission change, and the remediation steps taken to restore service.

Bot ManagementCDNCloudflare

0 likes · 10 min read

What Caused the Massive Cloudflare Outage on Nov 18 2025? A Deep Technical Breakdown

Architect's Guide

Nov 20, 2025 · Operations

What Caused Cloudflare’s Half‑Internet Outage? A Deep Dive into the Technical Failure

Cloudflare suffered a massive multi‑hour outage that knocked offline popular sites and AI services, traced to a sudden traffic spike, a mis‑configured Rust‑based bot‑management module, and a database permission change that doubled a feature file size, overwhelming its routing software.

CDNClickHouseCloudflare

0 likes · 12 min read

What Caused Cloudflare’s Half‑Internet Outage? A Deep Dive into the Technical Failure

dbaplus Community

Nov 19, 2025 · Operations

Why Did Cloudflare’s Global Outage Happen on Nov 18 2025? Inside the Bot Management Bug

On the night of November 18 2025, Cloudflare suffered a worldwide outage that crippled services like ChatGPT, X, Spotify, and major gaming platforms, and a detailed post‑mortem reveals that a ClickHouse permission change caused an oversized bot‑management configuration file to crash edge nodes.

Bot ManagementCDNClickHouse

0 likes · 9 min read

Why Did Cloudflare’s Global Outage Happen on Nov 18 2025? Inside the Bot Management Bug

Java Architect Essentials

Nov 18, 2025 · Operations

What Triggered the Massive Cloudflare Outage on Nov 18 2025?

On November 18 2025 Cloudflare suffered a widespread network failure that crippled its CDN, DNS, and reverse‑proxy services, causing global sites like ChatGPT and X to return 502‑504 errors before engineers applied temporary mitigations and gradually restored service.

CDNCloudflareOutage

0 likes · 5 min read

What Triggered the Massive Cloudflare Outage on Nov 18 2025?

DevOps Coach

Nov 11, 2025 · Cloud Computing

Why the US‑East‑1 AWS Outage Happened and How to Guard Against It

On October 19‑20 a massive AWS failure in the US‑East‑1 region crippled a large portion of the internet, exposing how a faulty internal monitoring tool, DynamoDB’s lack of cross‑region replication, and unchecked retry storms can cascade into a widespread outage, and offering concrete operational lessons for cloud teams.

AWSCloud ComputingDynamoDB

0 likes · 7 min read

Why the US‑East‑1 AWS Outage Happened and How to Guard Against It

Efficient Ops

Oct 29, 2025 · Operations

What Triggered the Massive AWS Outage and Its Global Ripple Effect?

In late October 2025, a DNS failure in AWS’s DynamoDB service triggered a cascade of outages across EC2, load balancers, and Lambda, causing a 14‑hour global disruption that impacted over 3,500 applications, while a simultaneous Taobao overload highlighted the challenges of scaling during traffic spikes.

AWSCloudDynamoDB

0 likes · 8 min read

What Triggered the Massive AWS Outage and Its Global Ripple Effect?

Java Tech Enthusiast

Apr 15, 2025 · Industry Insights

Why GitHub Suddenly Blocked Unauthenticated Users in China – What Developers Need to Know

GitHub’s recent outage, caused by an unexpected configuration change, prevented unauthenticated users in China from accessing the site from April 12 to 13 UTC, prompting widespread discussion and urging developers to consider the risks of relying solely on foreign code‑hosting platforms.

ChinaGitHubOutage

0 likes · 3 min read

Why GitHub Suddenly Blocked Unauthenticated Users in China – What Developers Need to Know

Efficient Ops

Dec 22, 2024 · Operations

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

On December 11, OpenAI suffered a severe outage across ChatGPT, its API, and Sora due to a misconfigured telemetry service that overloaded Kubernetes control planes worldwide, prompting a cascade of failures and a coordinated recovery effort.

Incident ManagementKubernetesOpenAI

0 likes · 8 min read

What Caused OpenAI’s Massive Outage? Inside the Kubernetes Failure and Recovery

DevOps Operations Practice

Dec 16, 2024 · Cloud Native

Analysis of OpenAI's December 2024 Outage: Kubernetes Control Plane Overload and Mitigation

The December 11, 2024 OpenAI outage, caused by a misconfigured monitoring service that overloaded the Kubernetes control plane, led to a four‑hour service disruption and was resolved through cluster scaling, API blocking, and resource expansion, highlighting critical infrastructure risks for large‑scale cloud‑native operations.

Control PlaneKubernetesOpenAI

0 likes · 7 min read

Analysis of OpenAI's December 2024 Outage: Kubernetes Control Plane Overload and Mitigation

IT Services Circle

Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

Disaster RecoveryNetEase Cloud MusicOperations

0 likes · 6 min read

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

Top Architecture Tech Stack

Aug 19, 2024 · Operations

Analysis of NetEase Cloud Music Outage: Causes and Data‑Center Migration Challenges

On August 19 the NetEase Cloud Music service suffered a major outage that was traced to a complex migration of its Hangzhou data center to Guizhou, highlighting large‑scale operational risks, technical debt, and strict continuity constraints for high‑traffic internet platforms.

Cloud ComputingData Center MigrationNetEase Cloud Music

0 likes · 6 min read

Analysis of NetEase Cloud Music Outage: Causes and Data‑Center Migration Challenges

Cognitive Technology Team

Aug 17, 2024 · Operations

GitHub Outage on August 14, 2024: Causes, Impact, and Recovery

On August 14, 2024, GitHub experienced a massive site-wide outage caused by a database infrastructure configuration change that disrupted traffic routing, leading to loss of database connections and affecting core services such as Pull Requests, Pages, Copilot, and the API, with full restoration confirmed later that evening.

DatabaseGitHubIncident Management

0 likes · 2 min read

GitHub Outage on August 14, 2024: Causes, Impact, and Recovery

21CTO

Aug 15, 2024 · Operations

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

A detailed account of GitHub’s recent worldwide outage reveals that a rollback of database infrastructure changes caused widespread service failures across GitHub.com, Pages, Copilot, and the API, highlighting the challenges of stateful database reliability in large platforms.

GitHubIncident ManagementOperations

0 likes · 4 min read

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

ITPUB

May 7, 2024 · Operations

How Tiny Mistakes Turned Into Massive Outages: Real‑World Ops Incident Stories

A collection of firsthand accounts reveals how seemingly harmless actions—changing system time, mistyping a script name, accidental deletions, and reckless debugging—triggered large‑scale service disruptions, forced emergency rollbacks, and costly penalties, highlighting the high stakes of operational negligence.

Outageincident responsesystem-administration

0 likes · 10 min read

How Tiny Mistakes Turned Into Massive Outages: Real‑World Ops Incident Stories

Architecture Digest

Nov 14, 2023 · Cloud Computing

Alibaba Cloud Outage in November 2023 Affects Numerous Services and Regions

In November 2023, Alibaba Cloud suffered a massive outage that disrupted client, web, and mobile access for nearly eight hours, affecting a wide range of Alibaba products, over 300,000 enterprise customers, and numerous global data centers, highlighting the fragility of large-scale cloud services.

2023 incidentAlibabaAlibaba Cloud

0 likes · 9 min read

Alibaba Cloud Outage in November 2023 Affects Numerous Services and Regions

21CTO

Nov 13, 2023 · Cloud Computing

Inside Alibaba Cloud’s Massive Singles’ Day Outage: What Went Wrong?

On November 12, Alibaba Cloud suffered a global-scale failure that disrupted dozens of Alibaba‑affiliated apps—including Taobao, DingTalk, and Aliyun Drive—affecting services from e‑commerce to campus laundry, prompting widespread frustration and sparking discussions about cloud reliability and potential migration to other providers.

Alibaba CloudCase StudyOutage

0 likes · 6 min read

Inside Alibaba Cloud’s Massive Singles’ Day Outage: What Went Wrong?

IT Services Circle

Sep 12, 2023 · Operations

Microsoft Azure Sydney Data Center Outage: Causes, Impact, and Operational Lessons

A prolonged Azure outage in Sydney caused by a sudden power drop that disabled cooling systems, compounded by insufficient on‑site staff, led to service disruptions for over 24 hours and highlighted critical operational lessons for cloud data‑center management.

AzureCloud ComputingData Center

0 likes · 9 min read

Microsoft Azure Sydney Data Center Outage: Causes, Impact, and Operational Lessons

Wukong Talks Architecture

Dec 26, 2022 · Operations

Alibaba Cloud Hong Kong Region Outage Postmortem – December 18, 2023

On December 18, 2023, Alibaba Cloud's Hong Kong Region experienced a severe cooling‑system failure that caused a 14‑hour outage of ECS, OSS, EBS, RDS and other services, prompting extensive emergency procedures, service impact analysis, and a detailed post‑mortem with improvement actions.

Alibaba CloudCloud ComputingIncident Management

0 likes · 14 min read

Alibaba Cloud Hong Kong Region Outage Postmortem – December 18, 2023

Programmer DD

Dec 26, 2022 · Operations

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

On December 18, 2022, Alibaba Cloud's Hong Kong Region Zone C suffered a massive service interruption—the longest in its operational history—prompting a detailed incident response, extensive service impact across compute, storage, and networking, and a thorough analysis that led to concrete infrastructure and communication improvements.

Alibaba CloudIncident ReportOperations

0 likes · 13 min read

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

21CTO

Feb 9, 2022 · Operations

Why Roblox’s Three‑Day Outage Happened: Consul Streaming Bug and BoltDB Design Flaw

Roblox’s detailed post‑mortem reveals that a three‑day outage was caused by a Consul streaming bug and a design flaw in BoltDB’s freelist, which together created CPU contention and latency spikes on its massive on‑premises infrastructure, leading the team to disable streaming, add a second data‑center, and redesign their architecture.

BoltDBConsulOutage

0 likes · 9 min read

Why Roblox’s Three‑Day Outage Happened: Consul Streaming Bug and BoltDB Design Flaw

Java Backend Technology

Feb 7, 2022 · Operations

Why Did the Internet Crash in 2021? 10 Major Outage Lessons

The article reviews ten significant 2021 internet outages—both domestic and international—analyzing their root causes, from server room power failures to configuration bugs, and highlights the operational lessons engineers can learn to improve system resilience.

Case StudyCloud ComputingOperations

0 likes · 17 min read

Why Did the Internet Crash in 2021? 10 Major Outage Lessons

21CTO

Oct 6, 2021 · Operations

Why Did Facebook’s Global Outage Happen? Inside the BGP and DNS Failures

Facebook experienced a six‑hour worldwide outage that knocked out its main site and services like Instagram, WhatsApp, Messenger and Oculus, and engineers later traced the incident to a misconfigured backbone router that broke BGP routing and DNS resolution, sparking conspiracy rumors about data leaks.

BGPDNSFacebook

0 likes · 7 min read

Why Did Facebook’s Global Outage Happen? Inside the BGP and DNS Failures

macrozheng

Jul 18, 2021 · Operations

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

In this article, a programmer recounts the recent Bilibili outage, analyzes its timeline, proposes technical root‑cause hypotheses such as CDN failure and service‑chain avalanche, shares insights from the platform’s high‑availability architecture, and outlines preventive techniques for building more resilient backend systems.

BilibiliCDNHigh Availability

0 likes · 10 min read

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

21CTO

Jul 13, 2020 · Operations

Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes

The July 13, 2020 GitHub outage, triggered by load‑balancer misconfiguration, a database connection error during partitioning, and a network‑config mistake, sparked worldwide developer panic, highlighted reliability concerns, and revealed challenges in scaling cloud infrastructure amid the pandemic.

Cloud ComputingGitHubOutage

0 likes · 6 min read

Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes

Efficient Ops

Apr 16, 2020 · Operations

What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity

A four‑hour Cloudflare outage was triggered by an unauthorized cable removal during a planned maintenance, compounded by unclear instructions and unlabeled wiring, highlighting the need for better cable management, clear operational procedures, and robust single‑point‑of‑failure mitigation.

CloudflareNetwork ReliabilityOutage

0 likes · 3 min read

What Caused Cloudflare’s 4‑Hour Outage? Lessons on Cable Management and Process Clarity

21CTO

Mar 30, 2020 · Cloud Computing

What Triggered the Massive Google Cloud Outage on March 26 2020?

On March 26 2020 Google’s core services—including Search, Gmail, YouTube and G Suite—experienced a worldwide outage caused by a router failure in an Atlanta data center, a third‑party software bug that disrupted traffic across multiple regions, prompting detailed analysis from Google, DownDetector, ThousandEyes and other observers.

Google CloudNetwork ReliabilityOutage

0 likes · 8 min read

What Triggered the Massive Google Cloud Outage on March 26 2020?

ITPUB

Mar 27, 2020 · Information Security

Was GitHub Hacked? Inside the Suspected MITM Attack on GitHub

In late March, users in China reported errors accessing GitHub Pages and the main site, prompting investigations that suggest a possible man‑in‑the‑middle attack affecting GitHub’s services, with evidence such as a suspicious certificate issued to a QQ email, network hijacking on port 443, and similar disruptions across major Chinese ISPs, while the issue resolved by mid‑afternoon.

ChinaCyberattackGitHub

0 likes · 5 min read

Was GitHub Hacked? Inside the Suspected MITM Attack on GitHub

Java Backend Technology

Mar 15, 2020 · Databases

Why Did the Redis Official Site Crash? Inside the OOM Incident and Cheap Hosting Secrets

The Redis website experienced an unexpected outage caused by an OOM error due to insufficient memory on a low‑cost DigitalOcean droplet, and the maintainer quickly fixed it by upgrading the instance and limiting allkeys‑lru usage, revealing surprising details about the site’s infrastructure.

Database HostingDigitalOceanOutage

0 likes · 3 min read

Why Did the Redis Official Site Crash? Inside the OOM Incident and Cheap Hosting Secrets

ITPUB

Feb 21, 2017 · Operations

How We Resolved a Sudden DNS Outage That Took Down Our Website and App

When a Saturday early-morning outage left the company’s website and mobile app inaccessible for many users, the team traced the issue to an unpaid domain causing DNS resolution failures, detailed the investigation steps, temporary fixes, and lessons learned about DNS processes and operational readiness.

DNSIncident ManagementOutage

0 likes · 13 min read

How We Resolved a Sudden DNS Outage That Took Down Our Website and App