Operations 15 min read

What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More

The article reviews ten major 2021 service outages—from Chinese platforms like Bilibili and Futu to global giants such as Facebook, Roblox, and AWS—analyzing their root causes, redundancy failures, and the operational lessons needed to prevent future black‑swans.

21CTO

Mar 31, 2022

What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More

In 2021, numerous high‑profile internet services suffered severe outages, highlighting the growing risk of “black‑swan” failures in increasingly complex systems.

Domestic incidents

Bilibili crash : On July 13, the video platform experienced a server‑room fault that prevented user login, sparking widespread panic and even a fire‑department inquiry. The brief statement “partial server‑room failure” offered little insight.

Futu Securities outage : On October 9, a power flash in an ISP’s data center caused network disruption across multiple rooms. Founder Li Hua published a 2,000‑word technical post detailing multi‑region redundancy, dual‑IDC designs, and the trade‑off between performance and fault tolerance.

Xi'an “One‑Code” platform : Overloaded by pandemic‑related scans, the system crashed twice (Dec 2021 and Jan 2022), prompting authorities to urge users to limit QR‑code displays and to expand network capacity.

International incidents

Facebook, Instagram, WhatsApp outage : A routine maintenance command unintentionally cut all backbone links between Facebook’s data centers, resulting in a seven‑hour outage that erased roughly $47 billion in market value.

Roblox long‑duration outage : A bug in Consul’s streaming transport caused a 73‑hour service disruption; the company later disabled the feature and discussed why it keeps critical workloads on‑premises rather than fully migrating to public cloud.

Salesforce incident : A mis‑executed DNS configuration script timed out, propagating across data centers and causing a five‑hour outage; the responsible engineer was disciplined.

OVH data‑center fire : A blaze destroyed the Strasbourg SBG2 facility, taking down 3.6 million websites across 464 000 domains, including ESA’s ONDA service, French government portals, and crypto exchange Deribit.

Fastly CDN failure : A configuration change triggered a global 503 error wave, affecting major sites like Amazon, Twitter, and the New York Times for about an hour.

Google Cloud outage : A mis‑configured external load balancer (GCLB) caused a two‑hour disruption for services such as Home Depot, Spotify, YouTube, and Gmail.

AWS multiple outages : In December, three separate incidents—network overload from an internal client, automated traffic‑shifting software, and a data‑center power issue—impacted Disney, Netflix, Twitch, Slack, and many others.

These cases demonstrate that most failures stem from human error, insufficient redundancy, or inadequate capacity planning, underscoring the need for robust incident response, multi‑region design, and continuous testing.

high availability system reliability incident response outage analysis

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.