How Youzan Implements Traffic Control, Gray and Blue‑Green Deployments with Istio
This article details Youzan's design and implementation of a traffic‑control system built on Istio/Envoy, describing the protocols, architecture, and concrete JSON routing rules for gray releases and blue‑green deployments, along with observability features and future multi‑service release plans.
Background
With rapid growth of Youzan users and services, developers face increasing pressure to provide stable services while iterating quickly. As micro‑service interfaces and call‑chain lengths expand, regression testing becomes hard, and testing‑only validation can no longer guarantee stability.
To balance stability and speed, Youzan introduced a new‑version gray‑release strategy: only a few instances of the new version are deployed, traffic is gradually shifted, and full rollout occurs after verification.
Traffic Control System
Protocol Selection
The team evaluated several goals: a complete protocol supporting service‑mesh features (circuit‑break, rate‑limit, A/B testing), readability, and reuse of mature industry designs. They chose the Istio service‑mesh framework, which uses Envoy as the data plane and supports a JSON‑encoded routing protocol (Envoy v1 API, migrating to v2 gRPC API).
{
"name": "java-demo-rule",
"domains": ["java-demo"],
"routes": [{
"headers": [{"name": "userid", "value_match": "123"}],
"cluster": "java-demo|version=v2"
}, {
"weighted_clusters": {
"clusters": [
{"name": "java-demo|version=v2", "weight": 10},
{"name": "java-demo|version=v1", "weight": 90}
]
}
}]
}This rule routes requests whose header userid equals 123 to the v2 instance; all other traffic is split 10 % to v2 and 90 % to v1.
Architecture
The traffic‑control ecosystem consists of an HTTP gateway (Nginx), a service‑mesh sidecar proxy (Tether), the Dubbo RPC framework, Istio Pilot for rule distribution, and the Ops management system that translates product‑level controls into low‑level routing rules stored as CRDs in Kubernetes.
Gray Release
What Is Gray Release
Gray release deploys a small “canary” cluster alongside the stable cluster and routes a fraction of traffic to it for pre‑production validation. If issues appear, traffic is instantly switched back; otherwise the new version is fully rolled out.
Release Process
Start: User selects “Gray Release” in Ops.
Initialize: Deploy canary instances (10 % of stable capacity) with label canary=true. No traffic is sent until a rule is created.
Validate: Push routing rules. Two rule types are supported:
Shop‑list rule – routes requests from specific shop IDs to the canary.
Percentage rule – routes a configurable percentage (max 10 %) of traffic.
Cancel: If validation fails, delete the rule and take down the canary instantly.
Full Rollout: If validation succeeds, promote the new version to all instances and remove the canary.
End.
{
"name": "java-demo-rule",
"domains": ["java-demo"],
"routes": [{
"headers": [{"name": "shopid", "list_match": ["123", "456"]}],
"cluster": "java-demo|canary=true"
}, {
"cluster": "java-demo|canary=false"
}]
}Blue‑Green Release
What Is Blue‑Green Release
Blue‑green release creates a full‑size new cluster (green) in parallel with the existing stable cluster (blue). Traffic is gradually shifted to the green cluster; if problems arise, traffic can be switched back instantly, enabling rapid rollback.
Why Blue‑Green Is Needed
Full‑traffic rollback is faster than incremental rollbacks of gray releases.
Blue‑green can handle sudden traffic spikes because both clusters have full capacity.
It exposes issues that only appear under 100 % load (e.g., database deadlocks).
After successful verification, the new cluster becomes the stable one without further changes.
Release Process
Start: User selects “Blue‑Green Release” in Ops.
Initialize: Deploy the green cluster with label BlueGreenVersion=green while routing all traffic to the blue cluster.
Validate: Push routing rules (shop‑list or percentage) to shift part or all traffic to the green cluster.
Cancel: If validation fails, route all traffic back to blue and take down green.
Complete: When all traffic runs on green, decommission the blue cluster.
End.
{
"name": "java-demo-rule",
"domains": ["java-demo"],
"routes": [{
"headers": [{"name": "shopid", "list_match": ["123", "456"]}],
"cluster": "java-demo|BlueGreenVersion=green"
}, {
"cluster": "java-demo|BlueGreenVersion=blue"
}]
}Observability & Operability
Beyond routing, Youzan built monitoring and alerting for release processes: real‑time QPS, latency, and error‑rate dashboards for both old and new clusters; event notifications via enterprise IM for key milestones; a global release status view; and periodic statistical reports (weekly, monthly, quarterly).
Future Plans
Upcoming work includes coordinated multi‑application releases where a single rule controls traffic across several services, and extending traffic control to message‑queue consumption paths, which currently lack fine‑grained routing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
