Operations 8 min read

What Triggered the Massive AWS Outage and Its Global Ripple Effect?

In late October 2025, a DNS failure in AWS’s DynamoDB service triggered a cascade of outages across EC2, load balancers, and Lambda, causing a 14‑hour global disruption that impacted over 3,500 applications, while a simultaneous Taobao overload highlighted the challenges of scaling during traffic spikes.

Efficient Ops

Oct 29, 2025

What Triggered the Massive AWS Outage and Its Global Ripple Effect?

AWS Outage Overview

On October 19‑20, 2025, Amazon Web Services (AWS) suffered a large‑scale failure that lasted about 14‑15 hours, affecting more than 60 countries and roughly 3,500 internet applications.

Failure Process Summary

Phase 1: From 23:48 Oct 19 to 02:40 Oct 20 (UTC‑8), DynamoDB in the us‑east‑1 region showed a rising API error rate.

Phase 2: From 05:30 Oct 20 to 14:09 Oct 20, several Network Load Balancers (NLB) experienced a surge in connection errors caused by health‑check failures.

Phase 3: From 02:25 Oct 20 to 10:36 Oct 20, new EC2 instances failed to start; the issue was resolved by 13:50 Oct 20.

Root Cause

The AWS incident originated from a DNS resolution failure. A defect in DynamoDB’s automated DNS management system caused the service endpoint’s IP address to be unresolved, and the three‑hour DNS fix overwhelmed DynamoDB with back‑logged requests.

This failure was triggered by a latent defect in DynamoDB’s DNS management system, which produced an empty DNS record for the endpoint dynamodb.us-east-1.amazonaws.com . The automation did not recover the record promptly.

DynamoDB DNS Management Components

DNS Planner: Monitors load‑balancer status and generates new DNS allocation plans.

DNS Executor: Applies the plans to Route 53 in three independent, redundant Availability Zones for resiliency.

Propagation and Impact

The DNS issue cascaded through the AWS ecosystem. EC2’s underlying management subsystem (DWFM) depends on DynamoDB for lease management, so the failure propagated to EC2, Lambda, and other services. Major global services—including Snapchat, Reddit, Discord, Roblox, Fortnite, PlayStation Network, Coinbase, Robinhood, Venmo, Canva, Asana, Zoom, Amazon Prime Video, Netflix, and Amazon’s own e‑commerce platform—experienced access problems or complete outages. Even UK customs and McDonald’s ordering systems were affected, and many Docker users could not pull images.

Taobao Outage

On the same night, Taobao’s Double‑11 pre‑sale promotion caused a massive traffic surge that exceeded server capacity, leading to payment page freezes, duplicate charges, and order status errors. The incident exposed insufficient elastic scaling and highlighted the need for better handling of sudden traffic spikes. The service largely recovered before 22:00, with the promotional credit issue claimed to have saved users ¥8.3 billion.

Conclusion and Reflections

These events underscore the fragility of large‑scale distributed systems and the importance of robust automation, dependency management, and recovery processes.

Automation is not a panacea: Latent defects in automated systems can become the root cause of major failures.

Dependency management is critical: Failures in core services can trigger domino effects across dependent services.

Recovery pathways must be resilient: Even after the root cause is fixed, the recovery process can collapse under massive request loads.

AWS cloud DynamoDB Outage

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.