Tagged articles
30 articles
Page 1 of 1
DevOps Coach
DevOps Coach
Nov 11, 2025 · Cloud Computing

Why the US‑East‑1 AWS Outage Happened and How to Guard Against It

On October 19‑20 a massive AWS failure in the US‑East‑1 region crippled a large portion of the internet, exposing how a faulty internal monitoring tool, DynamoDB’s lack of cross‑region replication, and unchecked retry storms can cascade into a widespread outage, and offering concrete operational lessons for cloud teams.

AWSDynamoDBOutage
0 likes · 7 min read
Why the US‑East‑1 AWS Outage Happened and How to Guard Against It
Efficient Ops
Efficient Ops
Oct 29, 2025 · Operations

What Triggered the Massive AWS Outage and Its Global Ripple Effect?

In late October 2025, a DNS failure in AWS’s DynamoDB service triggered a cascade of outages across EC2, load balancers, and Lambda, causing a 14‑hour global disruption that impacted over 3,500 applications, while a simultaneous Taobao overload highlighted the challenges of scaling during traffic spikes.

AWSDynamoDBOutage
0 likes · 8 min read
What Triggered the Massive AWS Outage and Its Global Ripple Effect?
DevOps Operations Practice
DevOps Operations Practice
Dec 16, 2024 · Cloud Native

Analysis of OpenAI's December 2024 Outage: Kubernetes Control Plane Overload and Mitigation

The December 11, 2024 OpenAI outage, caused by a misconfigured monitoring service that overloaded the Kubernetes control plane, led to a four‑hour service disruption and was resolved through cluster scaling, API blocking, and resource expansion, highlighting critical infrastructure risks for large‑scale cloud‑native operations.

Control PlaneKubernetesOpenAI
0 likes · 7 min read
Analysis of OpenAI's December 2024 Outage: Kubernetes Control Plane Overload and Mitigation
IT Services Circle
IT Services Circle
Aug 21, 2024 · Operations

Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons

On August 19, NetEase Cloud Music suffered a severe infrastructure‑related outage that prevented user login, playlist loading, and song search, prompting a two‑hour recovery effort, a brief free‑membership compensation, and highlighting the critical role of proper change management, gray releases, disaster recovery, and cross‑functional coordination in large‑scale services.

InfrastructureNetEase Cloud MusicOperations
0 likes · 6 min read
Analysis of NetEase Cloud Music Outage on August 19: Infrastructure Failure and Operational Lessons
Cognitive Technology Team
Cognitive Technology Team
Aug 17, 2024 · Operations

GitHub Outage on August 14, 2024: Causes, Impact, and Recovery

On August 14, 2024, GitHub experienced a massive site-wide outage caused by a database infrastructure configuration change that disrupted traffic routing, leading to loss of database connections and affecting core services such as Pull Requests, Pages, Copilot, and the API, with full restoration confirmed later that evening.

GitHubOutagedatabase
0 likes · 2 min read
GitHub Outage on August 14, 2024: Causes, Impact, and Recovery
21CTO
21CTO
Aug 15, 2024 · Operations

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

A detailed account of GitHub’s recent worldwide outage reveals that a rollback of database infrastructure changes caused widespread service failures across GitHub.com, Pages, Copilot, and the API, highlighting the challenges of stateful database reliability in large platforms.

GitHubOperationsOutage
0 likes · 4 min read
Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained
ITPUB
ITPUB
May 7, 2024 · Operations

How Tiny Mistakes Turned Into Massive Outages: Real‑World Ops Incident Stories

A collection of firsthand accounts reveals how seemingly harmless actions—changing system time, mistyping a script name, accidental deletions, and reckless debugging—triggered large‑scale service disruptions, forced emergency rollbacks, and costly penalties, highlighting the high stakes of operational negligence.

OutageSystem Administrationincident response
0 likes · 10 min read
How Tiny Mistakes Turned Into Massive Outages: Real‑World Ops Incident Stories
Architecture Digest
Architecture Digest
Nov 14, 2023 · Cloud Computing

Alibaba Cloud Outage in November 2023 Affects Numerous Services and Regions

In November 2023, Alibaba Cloud suffered a massive outage that disrupted client, web, and mobile access for nearly eight hours, affecting a wide range of Alibaba products, over 300,000 enterprise customers, and numerous global data centers, highlighting the fragility of large-scale cloud services.

2023 incidentAlibabaAlibaba Cloud
0 likes · 9 min read
Alibaba Cloud Outage in November 2023 Affects Numerous Services and Regions
21CTO
21CTO
Nov 13, 2023 · Cloud Computing

Inside Alibaba Cloud’s Massive Singles’ Day Outage: What Went Wrong?

On November 12, Alibaba Cloud suffered a global-scale failure that disrupted dozens of Alibaba‑affiliated apps—including Taobao, DingTalk, and Aliyun Drive—affecting services from e‑commerce to campus laundry, prompting widespread frustration and sparking discussions about cloud reliability and potential migration to other providers.

Alibaba CloudOutagecase study
0 likes · 6 min read
Inside Alibaba Cloud’s Massive Singles’ Day Outage: What Went Wrong?
Programmer DD
Programmer DD
Dec 26, 2022 · Operations

Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned

On December 18, 2022, Alibaba Cloud's Hong Kong Region Zone C suffered a massive service interruption—the longest in its operational history—prompting a detailed incident response, extensive service impact across compute, storage, and networking, and a thorough analysis that led to concrete infrastructure and communication improvements.

Alibaba CloudIncident ReportInfrastructure
0 likes · 13 min read
Inside Alibaba Cloud Hong Kong Region C Outage: Timeline, Impact, and Lessons Learned
21CTO
21CTO
Feb 9, 2022 · Operations

Why Roblox’s Three‑Day Outage Happened: Consul Streaming Bug and BoltDB Design Flaw

Roblox’s detailed post‑mortem reveals that a three‑day outage was caused by a Consul streaming bug and a design flaw in BoltDB’s freelist, which together created CPU contention and latency spikes on its massive on‑premises infrastructure, leading the team to disable streaming, add a second data‑center, and redesign their architecture.

BoltDBConsulInfrastructure
0 likes · 9 min read
Why Roblox’s Three‑Day Outage Happened: Consul Streaming Bug and BoltDB Design Flaw
Java Backend Technology
Java Backend Technology
Feb 7, 2022 · Operations

Why Did the Internet Crash in 2021? 10 Major Outage Lessons

The article reviews ten significant 2021 internet outages—both domestic and international—analyzing their root causes, from server room power failures to configuration bugs, and highlights the operational lessons engineers can learn to improve system resilience.

OperationsOutagecase study
0 likes · 17 min read
Why Did the Internet Crash in 2021? 10 Major Outage Lessons
21CTO
21CTO
Oct 6, 2021 · Operations

Why Did Facebook’s Global Outage Happen? Inside the BGP and DNS Failures

Facebook experienced a six‑hour worldwide outage that knocked out its main site and services like Instagram, WhatsApp, Messenger and Oculus, and engineers later traced the incident to a misconfigured backbone router that broke BGP routing and DNS resolution, sparking conspiracy rumors about data leaks.

BGPDNSFacebook
0 likes · 7 min read
Why Did Facebook’s Global Outage Happen? Inside the BGP and DNS Failures
macrozheng
macrozheng
Jul 18, 2021 · Operations

Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures

In this article, a programmer recounts the recent Bilibili outage, analyzes its timeline, proposes technical root‑cause hypotheses such as CDN failure and service‑chain avalanche, shares insights from the platform’s high‑availability architecture, and outlines preventive techniques for building more resilient backend systems.

BilibiliCDNOperations
0 likes · 10 min read
Why Did Bilibili Crash? A Developer’s Deep Dive into High‑Availability Failures
21CTO
21CTO
Jul 13, 2020 · Operations

Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes

The July 13, 2020 GitHub outage, triggered by load‑balancer misconfiguration, a database connection error during partitioning, and a network‑config mistake, sparked worldwide developer panic, highlighted reliability concerns, and revealed challenges in scaling cloud infrastructure amid the pandemic.

GitHubInfrastructureOutage
0 likes · 6 min read
Why Did GitHub Crash? Inside the July 2020 Outage and Its Root Causes
21CTO
21CTO
Mar 30, 2020 · Cloud Computing

What Triggered the Massive Google Cloud Outage on March 26 2020?

On March 26 2020 Google’s core services—including Search, Gmail, YouTube and G Suite—experienced a worldwide outage caused by a router failure in an Atlanta data center, a third‑party software bug that disrupted traffic across multiple regions, prompting detailed analysis from Google, DownDetector, ThousandEyes and other observers.

Google CloudNetwork ReliabilityOutage
0 likes · 8 min read
What Triggered the Massive Google Cloud Outage on March 26 2020?
ITPUB
ITPUB
Mar 27, 2020 · Information Security

Was GitHub Hacked? Inside the Suspected MITM Attack on GitHub

In late March, users in China reported errors accessing GitHub Pages and the main site, prompting investigations that suggest a possible man‑in‑the‑middle attack affecting GitHub’s services, with evidence such as a suspicious certificate issued to a QQ email, network hijacking on port 443, and similar disruptions across major Chinese ISPs, while the issue resolved by mid‑afternoon.

ChinaCyberattackGitHub
0 likes · 5 min read
Was GitHub Hacked? Inside the Suspected MITM Attack on GitHub
ITPUB
ITPUB
Feb 21, 2017 · Operations

How We Resolved a Sudden DNS Outage That Took Down Our Website and App

When a Saturday early-morning outage left the company’s website and mobile app inaccessible for many users, the team traced the issue to an unpaid domain causing DNS resolution failures, detailed the investigation steps, temporary fixes, and lessons learned about DNS processes and operational readiness.

DNSOutageincident management
0 likes · 13 min read
How We Resolved a Sudden DNS Outage That Took Down Our Website and App