Operations 11 min read

Gray Release (Canary Deployment) Strategies and Practices

The article explains gray release as a smooth, risk‑mitigating deployment method, outlines why it is needed, describes its limitations, and compares four practical gray‑release solutions—including code‑level flags, pre‑release machines, SET isolation, and dynamic routing—before recommending a combined approach.

Baidu Intelligent Testing
Baidu Intelligent Testing
Baidu Intelligent Testing
Gray Release (Canary Deployment) Strategies and Practices

Gray release refers to a smooth transition between old and new versions, similar to an A/B test where a portion of users continue using version A while others switch to version B; if B performs well, it is gradually rolled out to all users, ensuring overall system stability and early issue detection.

Why Use Gray Release

1. Internet services change frequently with short release cycles, making it hard to balance speed and quality.

2. Gray release reduces release risk and limits the impact scope.

3. It lowers dependence on extensive testing and reduces the cost of constructing offline test data.

4. Centralized log monitoring becomes easier, as full releases often obscure complete call chains due to load balancing.

5. Test accounts can be gray‑released first, then real user accounts, further reducing risk.

6. It simplifies rollback.

Problems Gray Release Cannot Solve

The “tolerable impact” mentioned must be recoverable, such as a temporary API outage that can be fixed later. Permanent loss or corruption of user data (e.g., product or order information) is unacceptable; architects must provide backup, write‑ahead logs, and other safeguards to restore data from recent snapshots.

Tips

Start with gray‑testing accounts to lower the risk of damaging or losing real user data.

Desired Effect

Regardless of the change type, we want specific requests to be routed to the gray version so that we can observe and verify behavior.

Gray Strategies

Typical routing criteria include:

1. Specific users (e.g., test accounts)

2. Specific apps (e.g., test or partner apps)

3. Specific modules or interfaces (only certain APIs are gray‑tested)

4. Specific machines (certain IPs are forwarded to gray servers)

Gray Solution Discussion

Solution 1: Code‑Level Flag (Amazon Approach)

Implementation: embed a switch in the code and use an if‑else check; set the switch on for gray machines, off otherwise. Each release maintains two versions.

Advantages

Fast rollback without redeploying or restarting the system.

Disadvantages

a. Intrusive to the codebase. b. Branch logic adds complexity.

Example: At Alibaba, a status variable was used to switch a product database from Oracle to MySQL, achieving a smooth migration.

Solution 2: Pre‑Release Machine (Alibaba Approach)

This is not a true gray release; the pre‑release machine is an internal IP without external exposure, requiring domain binding for verification. Data is fully live, so it effectively serves a subset of internal test users. Similar concepts exist in our API as the Gamma environment.

Advantages

Simple to set up.

Disadvantages

a. Consumes an entire machine (can be repurposed after release with ops support). b. Lacks flexibility. c. Only suitable for front‑end layers; IDL service gray‑release needs separate handling.

Solution 3: SET Deployment

1. Business‑Level Isolation

Deploy at the API Container level, e.g.:

a. 微购物 API Container: api.weigou.qq.com b. 拍拍 API Container: api.paipai.com c. 易迅 API Container: api.yixun.com d. 网购 API Container: api.buy.qq.com

Further granularity can be achieved at the module level, such as a virtual e‑commerce API deployed on dedicated machines and routed via Nginx.

2. User‑Level Isolation

For platforms like QQ, users are divided into sets of 100 million IDs. A release can target a specific set (e.g., SET 10) to minimize impact.

Advantages

Isolated deployment reduces cross‑business impact and supports automatic gray release.

Disadvantages

a. Granularity depends on deployment isolation, often coarse. b. May waste resources compared to centralized deployment. c. Different business lines may run different versions, complicating management. d. Higher implementation and deployment cost.

Solution 4: Dynamic Routing

Method: Use a configurable gray strategy that influences the load balancer to return specific IPs/ports for gray services. Suitable for backend IDL service gray releases.

Advantages

Flexible and controllable.

Disadvantages

a. Current configuration centers and L5 do not support custom routing strategies and lack extensibility; external development is required. b. API metadata is scattered across multiple sources, necessitating an additional gray‑routing data source.

Final Solution

1. API Container adopts the pre‑release machine model for gray release. 2. IDL services adopt the dynamic routing model (limited to UIN or IP sources, as there is no AppID concept).

Conclusion

Gray release is not only a technical strategy but also a mindset; internet products constantly upgrade, bringing risks of incompatibility, user churn, and system downtime. Many teams adopt gray release to concentrate impact, enable quick rollback, and ensure stable evolution.

For large‑scale user testing, Baidu’s Mobile Cloud Testing Platform (MTC) leverages a crowdsourced model with 10,000 certified testers to provide massive user testing, rapid recruitment, and low‑cost feedback collection.

Baidu MTC offers bug exploration, compatibility testing, real‑device remote debugging, and security vulnerability scanning for developers.

Note: This article is Baidu exclusive content; please credit “Baidu QA” when reproducing.

Operationsgray releaserisk mitigationCanary Deploymentdeployment strategy
Baidu Intelligent Testing
Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.