What Caused Google Cloud’s Massive June 2025 Outage and What We Can Learn
On June 12, 2025, a faulty policy update in Google’s Service Control triggered null‑pointer crashes across regions, causing a global outage that also impacted Cloudflare, Twitch, Discord, and others; the incident exposed missing feature flags, inadequate error handling, and lack of exponential backoff, prompting rapid SRE remediation.
On June 12, 2025, Google experienced one of its most severe global outages since 2009, affecting Google Cloud, Google APIs, and downstream services such as Cloudflare, Twitch, Discord, Spotify, Claude, OpenAI, and Microsoft 365.
The outage originated from Google’s Service Control, the component that performs policy checks and quota management for API requests. A May‑end update introduced a new quota‑checking logic that was not exercised during regional rollout, leaving a code path without proper error handling or a feature‑flag safeguard.
At 10:45 a.m. Pacific time, a policy update containing blank fields was propagated globally within seconds. Service Control fetched these empty fields during quota checks, hit a null‑pointer exception, and repeatedly crashed and restarted in every region.
Lack of error handling caused a null‑pointer crash.
The “herd effect” caused Spanner databases to overload as Service Control instances restarted.
The SRE team identified the root cause within 10 minutes, prepared an emergency “red button” rollback in 25 minutes, and deployed it in 40 minutes, allowing smaller regions to recover quickly. Larger regions such as us‑central‑1 required up to 2 hours 40 minutes due to the herd effect and missing exponential backoff.
Google’s post‑mortem highlights four key problems:
1. Missing feature flag
The new quota‑checking path lacked a feature‑flag, preventing staged validation and early detection.
2. Insufficient error handling
The code did not defensively handle unexpected data (blank fields), leading to unhandled null‑pointer exceptions.
3. No randomized exponential backoff
Restart tasks did not implement backoff, causing a herd effect that overloaded Spanner.
4. Global synchronization risk
Policy data is replicated worldwide in real time, so a problem in one region instantly spreads globally.
The incident reinforces two stability principles: prevent problems before they occur and address small issues before they grow.
References: Google outage report https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW; Cloudflare outage analysis https://blog.cloudflare.com/cloudflare-service-outage-june-12-2025/.
Tech Architecture Stories
Internet tech practitioner sharing insights on business architecture, technology, and a lifelong love of tech.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
