What Triggered Cloudflare’s Massive November 2023 Outage? Inside the Bot Management Failure
On November 18, 2023 Cloudflare suffered a multi‑hour network outage that crippled major services worldwide, caused by a ClickHouse permission change that generated oversized bot‑management feature files, leading to 5xx errors across CDN, security, and authentication layers, and prompting a complex, step‑by‑step remediation effort.
Overview
On 2023‑11‑18 Cloudflare experienced a severe network failure that disrupted global internet traffic for several hours. The incident affected a wide range of downstream applications—including Twitter, ChatGPT, Spotify, Canva, Uber, Zoom, and many others—by returning HTTP 5xx errors and causing service unavailability.
Root Cause
The outage originated from a routine security hardening of the ClickHouse database used by Cloudflare’s Bot Management system. A permission change caused queries on system tables to return duplicate columns, which doubled the size of the feature file generated every five minutes. The oversized feature file exceeded the module’s hard‑coded limit of 200 features, triggering a panic in the FL/FL2 proxy agents and causing them to return 5xx responses.
Impact on Services
Because the Bot Management module is integrated into the core proxy (FL), the failure propagated to all services that rely on it, including:
Core CDN and security services (HTTP 5xx errors)
Rotating gate (load failures)
Workers KV (authentication failures)
Dashboard (login issues due to Turnstile outage)
Email security (reduced spam detection accuracy)
Access (authentication failures)
Additional symptoms included a downed status page, increased latency from debugging subsystems, and temporary spikes in internal API errors.
Timeline and Mitigation Steps
Key moments in the incident response were:
11:05 UTC – Database access‑control change deployed.
11:28 UTC – First HTTP errors observed by customers.
11:32‑13:05 UTC – Workers KV traffic spikes; team applied traffic‑control and account‑limit mitigations.
13:05 UTC – Workers KV and Access were bypassed to older proxy version, reducing impact.
13:37 UTC – Decision to roll back the Bot Management feature file to the last known good version.
14:24 UTC – Automatic generation of new Bot Management files stopped.
14:30 UTC – Correct feature file deployed globally; core traffic began to recover.
17:06 UTC – All services fully restored.
Technical Details
The problematic ClickHouse query was similar to:
SELECT name, type FROM system.columns WHERE table='http_requests_features' ORDER BY name;Because the query no longer filtered by database name after the permission change, it returned rows from the r0 schema in addition to the default schema, effectively doubling the number of feature rows.
The Rust panic that halted the FL2 worker thread was captured as:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err valueRemediation and Future Prevention
After restoring service, Cloudflare outlined several long‑term actions:
Improve ingestion validation for generated configuration files.
Introduce more global kill‑switches for risky features.
Eliminate core dumps and error‑reporting mechanisms that consume excessive resources.
Review failure modes for all core proxy modules.
Visual Aids
Conclusion
The November 2023 Cloudflare outage was the most severe incident since 2019, highlighting the cascading risk of a single mis‑configured feature file in a globally distributed edge network. The post‑mortem demonstrates the importance of rigorous change validation, observability, and rapid rollback mechanisms in large‑scale cloud operations.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
