Operations 10 min read

What Caused the Massive Cloudflare Outage on Nov 18 2025? A Deep Technical Breakdown

On the night of November 18 2025, Cloudflare suffered a three‑hour core failure that crippled roughly half of the internet, and this article details the timeline, global impact, root cause in a ClickHouse permission change, and the remediation steps taken to restore service.

Architect
Architect
Architect
What Caused the Massive Cloudflare Outage on Nov 18 2025? A Deep Technical Breakdown

Incident Overview

At 19:20 UTC on 18 Nov 2025 Cloudflare experienced a major outage that lasted about three hours for core services and six hours for full restoration. The disruption caused HTTP 5xx errors or human‑verification pages for many global services.

Timeline

19:05 – Engineers deployed a ClickHouse access‑control change.

19:28 – Change went live; outage began.

19:32‑21:05 – Investigation.

21:05 – First mitigation applied, core issue persisted.

21:37 – Root cause identified.

22:24 – Generation of malformed configuration files stopped; nodes rolled back to previous version.

22:30 – Core services recovered.

01:06 (next day) – All systems fully restored.

Impact

The outage affected roughly half of global internet traffic. Services such as AI chat platforms (ChatGPT, Claude, Perplexity), social media (X, Discord, Grindr), streaming (Spotify), and online games (League of Legends, Minecraft) returned 500 Internal Server Error or were stuck on verification pages.

Root Cause Analysis

Cloudflare identified a bug triggered by a permission change in the ClickHouse database. The change allowed users to read metadata from the r0 database, causing query results to double in size. The enlarged result inflated the “feature” configuration file used by the Bot Management module. The file size exceeded the module’s limit, causing the Bot Management software on every edge node to crash, which in turn generated HTTP 5xx responses for downstream services including Workers KV and Access.

The Bot Management module relies on a machine‑learning model that scores each request. The model consumes the feature file that is regenerated every few minutes and distributed across the network. Duplicate rows introduced by the ClickHouse permission change caused a sudden, massive increase in the file size, leading to a cascade of failures.

Remediation

Cloudflare halted creation of new malformed configuration files, forced a rollback to the previous stable configuration, and restored core services by 22:30 UTC. Post‑mortem actions include:

Review and tighten ClickHouse permission‑change procedures.

Introduce validation checks on the size of generated Bot Management feature files before distribution.

Enhance monitoring to detect abnormal file‑size growth early.

References

Official outage report: https://blog.cloudflare.com/18-november-2025-outage/

Additional coverage: https://mp.weixin.qq.com/s/XmM9pjejZcMfH3gtO5DyZg and https://mp.weixin.qq.com/s/Lx2BiBiQPgsA5gbpJlNl3Q

operationsCDNincident analysisOutageCloudflareBot Management
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.