Operations 16 min read

What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents

A comprehensive review of ten major 2021 internet outages—from domestic platforms like Bilibili and Futu to global services such as Facebook, Roblox, and AWS—examines their root causes, the role of infrastructure design, and the operational lessons needed to improve system resilience.

Programmer DD
Programmer DD
Programmer DD
What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents

In 2021, despite expectations that modern internet services could achieve "never‑down" reliability, a series of high‑profile outages demonstrated that system failures remain common and often stem from human error, infrastructure design flaws, or external disruptions.

Domestic Outages: Transparency as a Skill

Bilibili crash leaves young users sleepless

On July 13, Bilibili experienced a server failure that prevented login, driving users to other platforms and trending on social media. The brief statement "some server rooms failed" offered little insight.

Futu Securities service interruption and a 2,000‑word technical apology

On October 9, Futu’s trading app went down due to a power outage in an operator’s data center, causing multi‑data‑center network failure. Founder Li Hua later published a detailed 2,000‑word post explaining redundancy designs, the trade‑off between performance and fault tolerance, and how an IDC power issue became the single point of failure.

Xi'an "One‑Code‑Pass" collapses twice in half a month

Heavy pandemic‑related traffic overwhelmed the platform in December 2021 and again on January 4, 2022, leading to service unavailability and prompting authorities to call for capacity expansion.

International Outages: Small Bugs, Big Trouble

Facebook’s worst outage ever, wiping $300 billion in market value

On October 4, a routine network‑capacity test inadvertently cut all backbone connections, leaving over 3 billion users offline for nearly seven hours and causing a massive market‑cap loss.

Roblox suffers a 73‑hour outage due to a Consul bug

Roblox’s self‑hosted data centers use Consul for service discovery; enabling a streaming‑transfer feature introduced a bug that degraded performance and crashed the platform for 54 hours before the feature was disabled.

Salesforce engineer’s shortcut triggers a global outage

On May 11, a mis‑executed DNS configuration script timed out and propagated across data centers, causing a five‑hour service disruption for millions of users.

Cloud‑Provider Failures: Massive Blast Radii

OVH data‑center fire disables 3.6 million websites

A fire in Strasbourg’s SBG2 facility destroyed one data center and damaged another, taking down sites across 464 000 domains, including government and cryptocurrency services.

Fastly misconfiguration causes a global CDN outage

On June 8, a service‑configuration change triggered a worldwide 503 error, affecting major sites such as Amazon, Twitter, and the New York Times for about an hour.

Google Cloud outage due to GCLB configuration bug

On November 16, an incorrectly configured external load balancer caused a two‑hour outage that impacted services like YouTube, Gmail, and many enterprise customers.

AWS experiences three separate outages in December

Network overload, automated traffic‑shifting errors, and a data‑center power issue led to multiple service disruptions affecting Netflix, Slack, Coinbase, and many other platforms.

system reliabilityincident responsecloud infrastructureoutage analysis
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.