Evolution of Zhaozhuan Test Environment Governance: From Physical Isolation to Tag‑Based Traffic Routing
This article describes how Zhaozhuan’s testing environment evolved through three versions—physical isolation, automatic‑IP‑tag routing, and manual‑tag routing—detailing the architectural background, implementation principles, advantages, drawbacks, and supporting tools that dramatically reduced deployment time and resource consumption while introducing new operational challenges.
1. Background and Requirements
Zhaozhuan’s system originally used a monolithic architecture with a single web service per node behind Nginx load balancing. As concurrency grew, the architecture shifted to micro‑services, making precise request routing to specific test nodes more complex.
1.1 Evolution of System Architecture
Monolithic architecture could easily direct traffic to a specific node by adjusting Nginx upstream or using direct IP:port. In micro‑service architecture, multiple services (A, B, C) form a longer chain, and simple upstream changes cannot target downstream services individually.
1.2 Testing Environment Requirements
Unlike production where all nodes run identical code, testing involves multiple parallel branches; each node may run different logic, requiring requests to be precisely routed to the intended service instance.
2. Traditional Solution – Physical Isolation
Physical isolation provides a completely separate test environment per requirement, containing all services, a registry, and MQ broker. While simple for a small number of services, it wastes resources when the system scales to hundreds of services.
3. Zhaozhuan Test Environment V1 – Improved Physical Isolation
3.1 Stable Environment
A stable environment mirrors production with all services. Test environments do not use a service registry; each service is assigned a unique domain name and host file entries are manually edited. For example, the stable service A at 192.168.1.1 is mapped in every test host file as 192.168.1.1 A.zhuaninc.com .
3.2 Dynamic Environment
Each requirement gets a dynamic environment on a KVM VM (e.g., IP 192.168.2.1 ). When deploying service A in this environment, the host entry for the stable IP is overridden to 127.0.0.1 A.zhuaninc.com , ensuring traffic reaches the dynamic instance.
Request a dynamic environment (e.g., 192.168.4.1 ) and receive a full host file mapping to stable services.
Deploy service E' and write 127.0.0.1 to its host entry.
Deploy services D, C, B, A' sequentially, each writing 127.0.0.1 to the host file.
Deploy the Entry service.
Deploy Nginx and modify service A’s upstream to point only to 127.0.0.1 .
This creates a single‑branch pipeline from service E to Nginx, allowing precise routing by host mapping.
3.3 Advantages and Disadvantages
Advantages
Strong isolation, similar to physical isolation.
Simplified link; traffic stays on one machine.
Disadvantages
Requires deploying all services from Nginx to the last tested service, leading to resource waste.
Deployment depends on service call relationships, causing low efficiency and potentially days of debugging.
Complex host management and error‑prone IP‑prefixed topics.
Limited memory on a single machine restricts long chains.
4. Zhaozhuan Test Environment V2 – Automatic IP‑Tag Traffic Routing
To reduce the number of services per dynamic environment (30‑60 → single‑digit) and cut setup time (hours → 30 min‑1 h), an automatic IP‑tag routing solution was introduced. Tags are derived from the VM’s IP, requiring no manual labeling.
Benefits: faster provisioning, fewer services per environment, but still suffers from VM provisioning latency and KVM memory limits.
5. Zhaozhuan Test Environment V3 – Manual‑Tag Traffic Routing
After dockerizing services to eliminate KVM memory constraints, IP‑based tagging became ineffective because each container has a distinct IP. Manual tags are therefore applied.
5.1 Dockerization
Services run in Docker containers, removing the need for pre‑allocated VM resources and eliminating memory caps.
5.2 Service and Traffic Tagging
When requesting an environment, a tag (e.g., yyy ) is assigned. The platform automatically adds a JVM argument -Dtag=yyy to each service. HTTP requests carry the tag via a header tag=yyy . Internal calls inherit the tag automatically.
5.3 Target Shape
Only services that need modification are deployed in the dynamic environment; all other services remain in the stable environment.
5.4 RPC Implementation
Service Registration, Discovery, and Invocation
Services register their tag with the registry. When service A calls B, it discovers all B instances (stable, dynamic with tag yyy , dynamic with tag xxx ) and selects the one whose tag matches the current request.
Tag Propagation
The custom RPC framework transmits the tag via an attachment field.
5.5 MQ Message Implementation
Consumption Principle
Both dynamic and stable environments share the same topic but use different consumer groups. Dynamic groups prepend ${tag} , stable groups prepend test_ . The MQ client adds the prefix automatically.
Issues
If a dynamic consumer goes offline, messages may be lost due to offset mismatches; the solution is to replay missed messages. Duplicate consumption is acceptable because RocketMQ guarantees at‑least‑once delivery.
Tag Transmission
RocketMQ’s extensible headers carry the routing tag.
5.6 In‑Process Tag Transmission
ThreadLocal
Standard ThreadLocal cannot cross new threads or thread pools; InheritableThreadLocal cannot cross thread pools either.
TransmittableThreadLocal
Alibaba’s open‑source TransmittableThreadLocal (via Java agent) enables transparent tag propagation across threads and thread pools.
5.7 Auxiliary Facilities
Wildcard Domain Resolution
Instead of configuring host entries, domains can embed the tag (e.g., app-${tag}.test.zhuanzhuan.com ) and resolve directly to the test Nginx.
Web Shell
A web‑based shell allows one‑click login to the container’s log directory without manual IP entry.
Debug Plugin
The plugin reads the service name and tag, queries the environment platform for the debug port, and automatically connects to the correct container.
6. Distributed Call Tracing System
To diagnose routing failures (e.g., D → E' missing logs), a tracing system records entry and exit points of each module, generating spans with TraceId and SpanId. Zhaozhuan’s system combines a custom client (Radar) that pushes spans to a Collector, which writes to Kafka; Zipkin consumes the data and provides a UI.
TraceId is injected into MDC via SLF4J, printed in logs, and returned to the front‑end via HTTP headers for quick lookup.
At key routing nodes, both the traffic tag ( global.route.context.tag ) and the instance tag ( global.route.instance.tag ) are recorded, allowing verification of correct routing.
7. Summary
Zhaozhuan’s test environment governance progressed through three versions: physical isolation, automatic‑IP‑tag routing, and manual‑tag routing. Physical isolation required days and 30‑60 services per environment; automatic IP tagging reduced this to 7‑8 services and ~30 min‑1 h setup; manual tagging further cut services to 3‑4 and setup to 2‑5 min, saving ~65% memory.
While tag‑based routing improves efficiency and reduces resource usage, it introduces complexities such as longer link chains, changing IPs, and the need for supporting tools like tracing, wildcard DNS, web shell, and debug plugins. Overall, the project received company awards for cost reduction and operational efficiency.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.