Operations 15 min read

Why a Credential‑Rotation Mistake Took Down Cloudflare R2 and Its Dependent Services

On March 21 2025, a mis‑deployed credential during R2 Gateway's key rotation caused a 1‑hour‑7‑minute outage that blocked all write operations and about 35% of reads across R2 and several downstream Cloudflare services, prompting a detailed post‑mortem and a set of corrective actions to improve visibility and safety of credential changes.

ITPUB
ITPUB
ITPUB
Why a Credential‑Rotation Mistake Took Down Cloudflare R2 and Its Dependent Services

On 21 March 2025, Cloudflare experienced a prolonged incident affecting R2 object storage and several dependent services. From 21:38 UTC to 22:45 UTC (a total of 1 hour 7 minutes), all write requests to R2 failed (100 % error rate) and roughly 35 % of read requests failed, while metadata‑only operations remained functional.

Root Cause

The outage was triggered during a scheduled credential rotation for the R2 Gateway Worker, which authenticates the storage backend. New ID/key pairs were mistakenly deployed to the development environment instead of the production Worker. When the old credentials were removed from the storage infrastructure, the production Gateway could no longer authenticate, leading to the observed service degradation. No data was lost; any successful HTTP responses indicated persisted uploads or modifications.

Timeline of Key Events

19:49 UTC – Credential rotation process started; old credentials retained for continuity.

20:19 UTC – wrangler secret put and wrangler deploy executed without the required --env flag, deploying the new credentials to the default (development) Worker.

20:20 UTC – The default‑environment Worker began using the new credentials, while the production Worker still held the old ones.

20:37 UTC – Old credentials removed from the storage backend.

21:38 UTC – R2 availability metrics began to decline; the impact started.

21:45 UTC – Global R2 alert triggered (error‑budget consumption 2 %).

21:51 UTC – Engineers observed a gradual drop in read/write availability; metadata operations were unaffected.

22:05 UTC – Public status page published.

22:15 UTC – New credentials generated to force re‑propagation.

22:30 UTC – Another deployment attempted, again missing --env, still targeting the default Worker.

22:36 UTC – Root cause confirmed: credentials were deployed to the non‑production Worker.

22:45 UTC – Correct credentials deployed to the production Worker ( --env production), restoring R2 availability.

22:54 UTC – Incident marked resolved.

Service‑Specific Impact

R2 : 100 % write failures; ~35 % read failures; metadata (head/list) operations unaffected.

Billing : Errors when customers attempted to download historical invoices stored in R2.

Cache Reserve : Increased origin requests due to R2 read failures, though end‑user resource requests did not fail because of cache fallback.

Email Security : Customer‑facing metrics that rely on R2 were not updated.

Images : All upload operations failed; successful image delivery dropped to ~25 %.

Key Transparency Auditor : All read/write operations failed.

Log Delivery : Significant delays (up to 70 minutes) in log processing; all logs eventually delivered after resolution.

Stream : 100 % upload failures; video delivery success fell to 94 % with occasional stuttering.

Vectorize : High query error rates; all insert and update operations failed (100 %).

R2 Architecture Overview

R2 consists of three main components: the production Gateway Worker (handling S3, REST, and Workers API requests), a metadata service, and the encrypted object storage backend.

Credential‑Rotation Procedure

Create a new ID/key pair for the storage infrastructure while retaining the old credentials.

Run wrangler secret put to store the new credentials in the R2 production Gateway Worker.

Execute wrangler deploy to push the new credentials as environment variables to the production Worker.

Remove the old credentials from the storage backend to complete the rotation.

Monitor dashboards and logs to verify the transition.

Crucially, both wrangler secret put and wrangler deploy default to the default environment when the --env flag is omitted. In this incident, the missing flag caused the new credentials to be applied to the development Worker instead of production.

Post‑Mortem Actions

Added logging tags that include the credential ID suffix, enabling explicit verification of which token is in use.

Updated internal processes to require a check that the new token ID suffix matches storage logs before deleting the old token.

Mandated that key rotations be performed via the hot‑patch deployment tool, which enforces environment flags and additional safety checks.

Revised the SOP to require at least two engineers to approve any credential change.

Ongoing work: extending the closed‑loop health‑check system to automatically report credential propagation status, and enhancing observability by adding upstream success‑rate views that bypass caches.

These measures aim to prevent similar credential‑rotation failures and improve overall resilience of the R2 service.

BLOG-2793 2
BLOG-2793 2
图片
图片
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingOperationsobject storagecredential management
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.