Gray Release (Canary Deployment) Strategies and Practices
The article explains gray release as a smooth, risk‑mitigating deployment method, outlines why it is needed, describes its limitations, and compares four practical gray‑release solutions—including code‑level flags, pre‑release machines, SET isolation, and dynamic routing—before recommending a combined approach.
Gray release refers to a smooth transition between old and new versions, similar to an A/B test where a portion of users continue using version A while others switch to version B; if B performs well, it is gradually rolled out to all users, ensuring overall system stability and early issue detection.
Why Use Gray Release
1. Internet services change frequently with short release cycles, making it hard to balance speed and quality.
2. Gray release reduces release risk and limits the impact scope.
3. It lowers dependence on extensive testing and reduces the cost of constructing offline test data.
4. Centralized log monitoring becomes easier, as full releases often obscure complete call chains due to load balancing.
5. Test accounts can be gray‑released first, then real user accounts, further reducing risk.
6. It simplifies rollback.
Problems Gray Release Cannot Solve
The “tolerable impact” mentioned must be recoverable, such as a temporary API outage that can be fixed later. Permanent loss or corruption of user data (e.g., product or order information) is unacceptable; architects must provide backup, write‑ahead logs, and other safeguards to restore data from recent snapshots.
Tips
Start with gray‑testing accounts to lower the risk of damaging or losing real user data.
Desired Effect
Regardless of the change type, we want specific requests to be routed to the gray version so that we can observe and verify behavior.
Gray Strategies
Typical routing criteria include:
1. Specific users (e.g., test accounts)
2. Specific apps (e.g., test or partner apps)
3. Specific modules or interfaces (only certain APIs are gray‑tested)
4. Specific machines (certain IPs are forwarded to gray servers)
Gray Solution Discussion
Solution 1: Code‑Level Flag (Amazon Approach)
Implementation: embed a switch in the code and use an if‑else check; set the switch on for gray machines, off otherwise. Each release maintains two versions.
Advantages
Fast rollback without redeploying or restarting the system.
Disadvantages
a. Intrusive to the codebase. b. Branch logic adds complexity.
Example: At Alibaba, a status variable was used to switch a product database from Oracle to MySQL, achieving a smooth migration.
Solution 2: Pre‑Release Machine (Alibaba Approach)
This is not a true gray release; the pre‑release machine is an internal IP without external exposure, requiring domain binding for verification. Data is fully live, so it effectively serves a subset of internal test users. Similar concepts exist in our API as the Gamma environment.
Advantages
Simple to set up.
Disadvantages
a. Consumes an entire machine (can be repurposed after release with ops support). b. Lacks flexibility. c. Only suitable for front‑end layers; IDL service gray‑release needs separate handling.
Solution 3: SET Deployment
1. Business‑Level Isolation
Deploy at the API Container level, e.g.:
a. 微购物 API Container: api.weigou.qq.com b. 拍拍 API Container: api.paipai.com c. 易迅 API Container: api.yixun.com d. 网购 API Container: api.buy.qq.com
Further granularity can be achieved at the module level, such as a virtual e‑commerce API deployed on dedicated machines and routed via Nginx.
2. User‑Level Isolation
For platforms like QQ, users are divided into sets of 100 million IDs. A release can target a specific set (e.g., SET 10) to minimize impact.
Advantages
Isolated deployment reduces cross‑business impact and supports automatic gray release.
Disadvantages
a. Granularity depends on deployment isolation, often coarse. b. May waste resources compared to centralized deployment. c. Different business lines may run different versions, complicating management. d. Higher implementation and deployment cost.
Solution 4: Dynamic Routing
Method: Use a configurable gray strategy that influences the load balancer to return specific IPs/ports for gray services. Suitable for backend IDL service gray releases.
Advantages
Flexible and controllable.
Disadvantages
a. Current configuration centers and L5 do not support custom routing strategies and lack extensibility; external development is required. b. API metadata is scattered across multiple sources, necessitating an additional gray‑routing data source.
Final Solution
1. API Container adopts the pre‑release machine model for gray release. 2. IDL services adopt the dynamic routing model (limited to UIN or IP sources, as there is no AppID concept).
Conclusion
Gray release is not only a technical strategy but also a mindset; internet products constantly upgrade, bringing risks of incompatibility, user churn, and system downtime. Many teams adopt gray release to concentrate impact, enable quick rollback, and ensure stable evolution.
For large‑scale user testing, Baidu’s Mobile Cloud Testing Platform (MTC) leverages a crowdsourced model with 10,000 certified testers to provide massive user testing, rapid recruitment, and low‑cost feedback collection.
Baidu MTC offers bug exploration, compatibility testing, real‑device remote debugging, and security vulnerability scanning for developers.
Note: This article is Baidu exclusive content; please credit “Baidu QA” when reproducing.
Baidu Intelligent Testing
Welcome to follow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.