Tech Architecture Stories
Tech Architecture Stories
Jun 14, 2025 · Operations

What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn

On June 12, 2025, a faulty policy update in Google’s Service Control triggered null‑pointer crashes across regions, causing a global outage that also impacted Cloudflare, Twitch, Discord, and others; the incident exposed missing feature flags, inadequate error handling, and lack of exponential backoff, prompting rapid SRE remediation.

Google CloudSREcloud operations
0 likes · 7 min read
What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn
MaGe Linux Operations
MaGe Linux Operations
Jan 6, 2025 · Operations

What 2024 Outages Teach Us About Building Resilient Systems

A review of major 2024 service disruptions—from Alibaba Cloud to OpenAI—highlights key lessons such as early disaster‑recovery planning, regular backups, load balancing, real‑time monitoring, performance tuning, and capacity planning to improve system reliability and reduce future downtime.

disaster recoveryoutage analysissystem reliability
0 likes · 5 min read
What 2024 Outages Teach Us About Building Resilient Systems
FunTester
FunTester
Nov 21, 2023 · Industry Insights

What Alibaba’s Recent Outages Reveal About Testing and Team Safety

The article examines three major Alibaba service disruptions, analyzes how insufficient testing and a lack of psychological safety among engineers may have contributed to the failures, and suggests ways to improve testing practices and workplace transparency.

AlibabaPsychological Safetycloud services
0 likes · 7 min read
What Alibaba’s Recent Outages Reveal About Testing and Team Safety
21CTO
21CTO
Mar 31, 2022 · Operations

What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More

The article reviews ten major 2021 service outages—from Chinese platforms like Bilibili and Futu to global giants such as Facebook, Roblox, and AWS—analyzing their root causes, redundancy failures, and the operational lessons needed to prevent future black‑swans.

high availabilityincident responseoutage analysis
0 likes · 15 min read
What Caused the Biggest 2021 Outages? Lessons from Bilibili, Facebook, AWS, and More
Programmer DD
Programmer DD
Feb 8, 2022 · Operations

What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents

A comprehensive review of ten major 2021 internet outages—from domestic platforms like Bilibili and Futu to global services such as Facebook, Roblox, and AWS—examines their root causes, the role of infrastructure design, and the operational lessons needed to improve system resilience.

cloud infrastructureincident responseoutage analysis
0 likes · 16 min read
What Triggered the Biggest Internet Outages of 2021? Lessons from 10 Major Incidents
Efficient Ops
Efficient Ops
Dec 19, 2016 · Operations

What 16 Major 2016 Outages Teach Us About Disaster Recovery

This article reviews sixteen notable 2016 service outages across finance, cloud, and entertainment, analyzes their causes—ranging from power failures to DDoS attacks—and highlights the critical need for robust disaster‑recovery and information‑security practices.

Information Securityincident managementoperations
0 likes · 11 min read
What 16 Major 2016 Outages Teach Us About Disaster Recovery