Operations 12 min read

What Caused Cloudflare’s Half‑Internet Outage? A Deep Dive into the Technical Failure

Cloudflare suffered a massive multi‑hour outage that knocked offline popular sites and AI services, traced to a sudden traffic spike, a mis‑configured Rust‑based bot‑management module, and a database permission change that doubled a feature file size, overwhelming its routing software.

Architect's Guide

Nov 20, 2025

What Caused Cloudflare’s Half‑Internet Outage? A Deep Dive into the Technical Failure

Incident Overview

In November 2025 Cloudflare experienced a prolonged outage lasting about five and a half hours, affecting services such as ChatGPT, Claude, Shopify, and many other websites.

Timeline

05:20 UTC – abnormal traffic detected.

~06:20 UTC – status update posted indicating internal service failures.

08:09 UTC – engineers identified the problem and began remediation.

08:13 UTC – London WARP VPN service re‑enabled.

09:34 UTC – control‑panel services recovered.

11:44 UTC – full outage declared resolved.

Root‑Cause Analysis

The post‑mortem attributes the failure to a chain of low‑probability events:

A database user‑permission change caused a SQL query to return duplicate rows.

The duplicate rows inflated a configuration file used by the bot‑management module.

The bot‑management module, written in Rust, called unwrap() on a configuration entry that could be missing, causing a panic.

A ClickHouse query change exposed all metadata from the r0 database, doubling the size of the generated feature file.

The oversized feature file exceeded the module’s size limit, crashing the routing software and producing HTTP 5xx errors for downstream services (Workers KV, Access, etc.).

Technical Details

The Rust component is part of Cloudflare’s effort to rewrite core proxy modules for memory safety. The problematic snippet performed an unconditional unwrap() on a value read from a generated configuration file. When the file contained duplicate entries, the value was absent, triggering a panic that propagated to the routing process.

ClickHouse query modifications removed a filter on the database name, causing the query to return rows from both the default schema and the r0 schema. This doubled the number of feature entries, inflating the file from roughly its normal size to about twice that size. The routing software enforces a hard limit on feature‑file size; exceeding it aborts the process.

Mitigation and Preventive Measures

Strengthen validation of internally generated configuration files against unexpected or missing entries.

Introduce additional global emergency‑shutdown switches for critical components.

Limit core dumps and error reports to prevent resource exhaustion.

Conduct a comprehensive review of failure modes across all core proxy modules.

Implications

The outage demonstrated the systemic risk of relying heavily on a single CDN provider. It also highlighted challenges associated with rapid language migrations (e.g., Rust) when safety checks such as avoiding unwrap() are not enforced.

References

Cloudflare post‑mortem: https://blog.cloudflare.com/18-november-2025-outage/

Additional coverage: https://siliconangle.com/2025/11/18/cloudflare-outage-briefly-takes-chatgpt-claude-services-offline/

Ars Technica analysis: https://arstechnica.com/tech-policy/2025/11/widespread-cloudflare-outage-blamed-on-mysterious-traffic-spike

Rust CDN ClickHouse incident response network operations Outage Cloudflare

Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.