What Caused Cloudflare’s Half‑Internet Outage? A Deep Dive into the Technical Failure
Cloudflare suffered a massive multi‑hour outage that knocked offline popular sites and AI services, traced to a sudden traffic spike, a mis‑configured Rust‑based bot‑management module, and a database permission change that doubled a feature file size, overwhelming its routing software.
Incident Overview
In November 2025 Cloudflare experienced a prolonged outage lasting about five and a half hours, affecting services such as ChatGPT, Claude, Shopify, and many other websites.
Timeline
05:20 UTC – abnormal traffic detected.
~06:20 UTC – status update posted indicating internal service failures.
08:09 UTC – engineers identified the problem and began remediation.
08:13 UTC – London WARP VPN service re‑enabled.
09:34 UTC – control‑panel services recovered.
11:44 UTC – full outage declared resolved.
Root‑Cause Analysis
The post‑mortem attributes the failure to a chain of low‑probability events:
A database user‑permission change caused a SQL query to return duplicate rows.
The duplicate rows inflated a configuration file used by the bot‑management module.
The bot‑management module, written in Rust, called unwrap() on a configuration entry that could be missing, causing a panic.
A ClickHouse query change exposed all metadata from the r0 database, doubling the size of the generated feature file.
The oversized feature file exceeded the module’s size limit, crashing the routing software and producing HTTP 5xx errors for downstream services (Workers KV, Access, etc.).
Technical Details
The Rust component is part of Cloudflare’s effort to rewrite core proxy modules for memory safety. The problematic snippet performed an unconditional unwrap() on a value read from a generated configuration file. When the file contained duplicate entries, the value was absent, triggering a panic that propagated to the routing process.
ClickHouse query modifications removed a filter on the database name, causing the query to return rows from both the default schema and the r0 schema. This doubled the number of feature entries, inflating the file from roughly its normal size to about twice that size. The routing software enforces a hard limit on feature‑file size; exceeding it aborts the process.
Mitigation and Preventive Measures
Strengthen validation of internally generated configuration files against unexpected or missing entries.
Introduce additional global emergency‑shutdown switches for critical components.
Limit core dumps and error reports to prevent resource exhaustion.
Conduct a comprehensive review of failure modes across all core proxy modules.
Implications
The outage demonstrated the systemic risk of relying heavily on a single CDN provider. It also highlighted challenges associated with rapid language migrations (e.g., Rust) when safety checks such as avoiding unwrap() are not enforced.
References
Cloudflare post‑mortem: https://blog.cloudflare.com/18-november-2025-outage/
Additional coverage: https://siliconangle.com/2025/11/18/cloudflare-outage-briefly-takes-chatgpt-claude-services-offline/
Ars Technica analysis: https://arstechnica.com/tech-policy/2025/11/widespread-cloudflare-outage-blamed-on-mysterious-traffic-spike
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
