Operations 17 min read

Why Did the Internet Crash in 2021? 10 Major Outage Lessons

The article reviews ten significant 2021 internet outages—both domestic and international—analyzing their root causes, from server room power failures to configuration bugs, and highlights the operational lessons engineers can learn to improve system resilience.

Java Backend Technology

Feb 7, 2022

Why Did the Internet Crash in 2021? 10 Major Outage Lessons

2021 Major Internet Outages Overview

In 2021, despite expectations of "never‑down" services, major internet outages persisted, exposing the growing risk of complex, large‑scale systems. The following ten incidents illustrate common failure patterns and the operational safeguards that were lacking.

Domestic Outages

Bilibili Crash

On July 13, the video platform Bilibili suffered a server‑room failure that prevented user logins, triggering a cascade of related outages on other sites and flooding social media with panic messages. The brief statement from Bilibili cited "partial server room faults" without detailed technical explanation.

Futu Securities Service Interruption

On October 9, the fintech app Futu experienced a power flash in an operator’s data center, causing multi‑data‑center network failure. Founder Li Hua later published a 2000‑word technical post detailing redundant design choices, highlighting that the chosen high‑performance redundancy introduced a single‑point IDC failure.

Xi'an “One Code” System Failures

In December 2021 and January 2022, Xi'an’s pandemic health‑code system "One Code" crashed twice due to massive traffic spikes from mandatory QR‑code checks and mass testing, overwhelming the platform’s capacity and prompting officials to advise citizens to avoid unnecessary scans.

Yuekang Code Issue

On January 10, the Guangdong health‑code app "Yuekang" experienced a traffic surge of up to 1.4 million requests per minute, triggering its protection mechanism and causing slow or failed access until the issue was mitigated within an hour.

Platform monitoring detected abnormal traffic at 8:31 am, reaching 1.4 million requests per minute, which triggered system protection; the issue was partially mitigated by 9:04 am and fully restored by 9:56 am.

International Outages

Facebook, Instagram, WhatsApp Outage

On October 4, a routine network‑capacity test inadvertently cut the backbone connections between Facebook’s global data centers, leaving over 30 billion users without service for nearly seven hours and wiping roughly $473 billion from Facebook’s market value.

Roblox Long‑Running Outage

From October 28, Roblox suffered a 73‑hour outage caused by a bug in Consul’s streaming‑transport feature, which degraded performance and forced the team to disable the feature before services could be restored.

Salesforce Bug‑Induced Outage

On May 11, a mis‑executed DNS configuration script timed out and propagated across Salesforce’s data centers, resulting in a five‑hour service disruption before the faulty change was rolled back.

OVH Data Center Fire

In March, a fire destroyed the SBG2 data center of French cloud provider OVH, taking down approximately 3.6 million websites across 464 000 domains and affecting customers such as the European Space Agency, a cryptocurrency exchange, and various government portals.

Fastly CDN Incident

On June 8, a configuration change at CDN provider Fastly triggered a global 503 error wave that impacted major sites like Amazon, Twitter, and the New York Times for about an hour before the change was reverted.

Google Cloud Outage

On November 16, an erroneous external proxy load‑balancer configuration caused Google Cloud services to fail for two hours, affecting customers such as Home Depot, Spotify, and Google’s own products like YouTube and Gmail.

AWS Multiple Outages in December

In the last month of 2021, AWS experienced three separate incidents—network overload from a client’s behavior, an automated traffic‑routing error, and a data‑center power failure—each disrupting major platforms like Netflix, Slack, and Coinbase.

These cases collectively demonstrate that even the most robust architectures can succumb to human error, insufficient redundancy, or unexpected traffic spikes, underscoring the importance of rigorous operational practices and disaster‑recovery planning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Case Study Cloud Computing Operations system reliability Outage

Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.