A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems
This article presents a lightweight gray release approach for complex flight ticket services, comparing traditional hardware and soft‑routing isolation methods, describing the authors' traffic‑based gray identification, business‑focused monitoring, implementation details, and automated safeguards to enable safe incremental deployments.
1 Background
Gray release is an industry practice to mitigate release risk; typical approaches require isolated gray environments, either hardware or software isolation, which are costly to implement and maintain. Due to the complexity of flight ticket business, release‑induced failures are frequent, creating an urgent need for a simple, feasible gray release solution.
2 Common Industry Solutions
To ensure evaluability, gray environments must be isolated, mainly achieved by two methods.
Gray Machine Isolation
Implementation: physically isolate an entire gray environment and handle online traffic.
Figure 1
Advantages: minimal application changes, supports monitoring isolation, traffic routing, manual verification.
Disadvantages: high implementation and maintenance cost.
Soft Routing Traffic Isolation
Implementation: use soft load‑balancing logic to isolate the gray environment.
Figure 2
Advantages: no extra cost, supports monitoring isolation, traffic routing, manual verification.
Disadvantages: large application changes, high risk.
3 Our Solution
Given the unmaintainable nature of hardware isolation in complex systems, we mainly consider soft routing isolation, but with many differences.
3.1 Thought Process
Goal: expose release risk with minimal traffic.
Define "small traffic" by user, route, or random proportion such as 1%, 5%, 10%.
Thus we consider routing traffic to designated environments; our answer is to treat traffic flowing through gray machines as small traffic.
Figure 3
We then identify whether gray‑traffic business status is normal via business monitoring. By separating gray‑traffic monitoring, fluctuations indicate release health.
Figure 4
Figure 5
Core monitoring includes business volume monitoring (Figure 4) and business result monitoring (Figure 5); gray release should focus on business result monitoring.
Two theoretical foundations:
1. Gray traffic is identified by flow through specific machines.
2. Monitoring emphasizes business result metrics.
3.2 Overall Solution Formed
Figure 6
The solution requires only about 0.5 person‑day to enable gray release capability.
Gray release process (Figure 7): release must target gray machines first, and full release proceeds only if gray monitoring is normal.
Figure 7
Automation monitors gray release health; if metrics fail, a warning is issued and full release is blocked.
3.3 Principle Introduction
How is the gray‑traffic identifier transmitted between systems?
Downward transmission uses a global trace component.
Upward transmission (protocol specific): Dubbo – attachment; HTTP – header; MQ – not needed.
Within a system, the identifier is stored in a global trace memory.
Figure 9
To avoid memory leak or OOM, we enforce total amount control (default max concurrent 5124) and timeout cleanup (default 60 s, although RPC tolerance is 30 s).
We cannot manage lifecycle via request start/end because requests may be sync, async, callbacks, MQ, etc.
Gray traffic monitoring isolation uses the monitoring system’s tag feature; gray traffic is tagged accordingly.
Figure 10
4 Solution Summary
End
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.