What Caused the Massive Cloudflare Outage on Nov 18 2025? A Deep Technical Breakdown
On the night of November 18 2025, Cloudflare suffered a three‑hour core failure that crippled roughly half of the internet, and this article details the timeline, global impact, root cause in a ClickHouse permission change, and the remediation steps taken to restore service.
Incident Overview
At 19:20 UTC on 18 Nov 2025 Cloudflare experienced a major outage that lasted about three hours for core services and six hours for full restoration. The disruption caused HTTP 5xx errors or human‑verification pages for many global services.
Timeline
19:05 – Engineers deployed a ClickHouse access‑control change.
19:28 – Change went live; outage began.
19:32‑21:05 – Investigation.
21:05 – First mitigation applied, core issue persisted.
21:37 – Root cause identified.
22:24 – Generation of malformed configuration files stopped; nodes rolled back to previous version.
22:30 – Core services recovered.
01:06 (next day) – All systems fully restored.
Impact
The outage affected roughly half of global internet traffic. Services such as AI chat platforms (ChatGPT, Claude, Perplexity), social media (X, Discord, Grindr), streaming (Spotify), and online games (League of Legends, Minecraft) returned 500 Internal Server Error or were stuck on verification pages.
Root Cause Analysis
Cloudflare identified a bug triggered by a permission change in the ClickHouse database. The change allowed users to read metadata from the r0 database, causing query results to double in size. The enlarged result inflated the “feature” configuration file used by the Bot Management module. The file size exceeded the module’s limit, causing the Bot Management software on every edge node to crash, which in turn generated HTTP 5xx responses for downstream services including Workers KV and Access.
The Bot Management module relies on a machine‑learning model that scores each request. The model consumes the feature file that is regenerated every few minutes and distributed across the network. Duplicate rows introduced by the ClickHouse permission change caused a sudden, massive increase in the file size, leading to a cascade of failures.
Remediation
Cloudflare halted creation of new malformed configuration files, forced a rollback to the previous stable configuration, and restored core services by 22:30 UTC. Post‑mortem actions include:
Review and tighten ClickHouse permission‑change procedures.
Introduce validation checks on the size of generated Bot Management feature files before distribution.
Enhance monitoring to detect abnormal file‑size growth early.
References
Official outage report: https://blog.cloudflare.com/18-november-2025-outage/
Additional coverage: https://mp.weixin.qq.com/s/XmM9pjejZcMfH3gtO5DyZg and https://mp.weixin.qq.com/s/Lx2BiBiQPgsA5gbpJlNl3Q
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
