Cloud Native 13 min read

Google Cloud 2025 Outage: Lessons Learned and Nacos Gray Release Solutions

A massive Google Cloud outage on June 12, 2025, caused by an untested Service Control feature triggered a null‑pointer exception that cascaded globally, and the article explains how configuration gray‑release techniques—especially Nacos IP and label canary deployments—can prevent similar disasters.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Google Cloud 2025 Outage: Lessons Learned and Nacos Gray Release Solutions

Introduction

On 12 June 2025, Google Cloud experienced a major failure that disrupted Gmail, YouTube, Google Search, Cloud APIs and countless downstream applications. The incident began at 10:51 PT and was fully resolved at 18:18 PT, lasting roughly 7 hours 27 minutes.

Root Cause Analysis

Google Cloud added a new quota‑policy check feature to the Service Control system and deployed it directly to production without sufficient testing or a canary rollout. The feature lacked proper error handling for blank fields, so when the new configuration was pushed the code threw a null‑pointer exception. This caused Service Control instances to become unresponsive and enter a crash‑loop, propagating the failure to dependent services.

Impact

Service Control is the core component for API management and quota enforcement. Its crash triggered a chain reaction that affected Gmail, YouTube, Search, Google Cloud services and any internet applications relying on those APIs, resulting in a worldwide outage.

What Is Configuration Gray Release?

Configuration gray release (also known as canary deployment) is a strategy that gradually rolls out new configuration parameters to a subset of service instances or users, reducing risk and allowing validation before a full rollout.

Typical scenarios : feature‑flag toggles, threshold adjustments, critical configuration changes in high‑availability systems.

Common Implementation Paths

Gray release can be implemented along several dimensions:

Identifier‑based : user‑ID, IP address, device type.

Rule‑based : whitelist/blacklist, tags, business rules.

Traffic‑based : percentage of traffic, geographic region, time window.

Architecture‑based : service‑mesh, API gateway, configuration center.

Configuration Centers Supporting Gray Release

Among mainstream configuration centers, Nacos and Apollo provide gray‑release capabilities. Nacos supports both IP‑based and label‑based gray releases; Apollo currently offers only IP‑based gray release but can be extended through custom development.

How to Perform Gray Release with Nacos

Core Features

IP‑based gray release – target specific instances by IP address.

Label‑based gray release – assign key‑value tags to instances and release configurations to matching tags.

Namespace isolation, dataId/group distinction, and other advanced controls.

Step‑by‑Step IP Gray Release

Log in to the MSE console ( https://mse.console.aliyun.com/) and select the target Nacos instance.

Edit the desired configuration and choose “IP‑based gray release”.

Select or manually enter the IP addresses of the instances that should receive the new configuration.

Click “Publish Gray”, confirm the comparison between the current production version and the gray version, then confirm the release.

Step‑by‑Step Label Gray Release

Set the application label on the client, e.g. nacos.config.gray.label=yourgrayname, via properties, JVM arguments, or environment variables.

In the console, edit the configuration and choose “Label‑based gray release”. Select the label key‑value pair that matches the target instances.

Publish the gray configuration and monitor the listener list to verify which instances receive the new version.

Gradually expand the label scope or roll back immediately if anomalies are detected.

Code Example (Java)

// Properties style
Properties properties = new Properties();
properties.put(PropertyKeyConst.SERVER_ADDR, "your endpoint");
properties.put("project.name", "your app name");
properties.put("nacos.config.gray.label", "yourgrayname");

// JVM argument
// -Dnacos.config.gray.label=yourgrayname

// Environment variable
// export nacos_config_gray_label=yourgrayname

String dataId = "gray_test_dataid";
String group = "test-group";
configService.addListener(dataId, group, new Listener() {
    @Override
    public Executor getExecutor() { return null; }
    @Override
    public void receiveConfigInfo(String configInfo) {
        System.out.println("receiveConfig:" + configInfo);
    }
});

Conclusion

The 2025 Google Cloud outage demonstrates how a single untested configuration change can trigger a worldwide service collapse. Employing gray‑release techniques—such as Nacos IP and label canary deployments—limits the blast radius of configuration errors, enables rapid rollback, and improves overall system reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeConfiguration Managementgray releaseNacosGoogle Cloudservice outage
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.