Evolution of Zhuanzhuan's Test Environments: From Monolithic Setups to Docker‑Based Dynamic and Stable Environments
This article details how Zhuanzhuan’s testing environment progressed from a handful of static machines to a Docker‑driven dynamic‑and‑stable architecture, addressing resource shortages, stability issues, and operational inefficiencies through IP routing, tag routing, and extensive automation, ultimately achieving significant reductions in resource usage, deployment time, and user‑reported problems.
1 Test Environment Evolution
Testing environments are a core component for any software company. Zhuanzhuan’s testing environment has evolved from a few static setups to a flexible Docker‑based dynamic and stable environment system, adapting to cluster expansion and new business demands.
1.1 Monolithic Environment
In 2017, Zhuanzhuan started with five 64 GB machines forming five complete test environments, sufficient for daily needs. One machine was allocated to developers and the rest to testers, with conflicts resolved through coordination.
1.2 Dynamic + Stable Environments
As micro‑services expanded, parallel branch development increased, and shared environments caused interference. A new model introduced dynamic environments for modified services and stable environments mirroring production. An environment platform managed the full lifecycle from request to reclamation, partially meeting the needs.
Problem: after a request entered the stable environment, calls could not reach services in the dynamic environment, forcing all upstream services, MQ producers, etc., to be deployed on the test machine, dramatically increasing resource consumption as the cluster grew.
1.3 Dynamic + Stable Environments (IP Routing)
To prioritize traffic to the dynamic environment and fall back to stable only when necessary, IP routing was implemented as a lane identifier. This reduced resource usage by about 30 %.
Despite the improvement, issues persisted as hardware shortages and scaling pressures continued.
2 Problems in Environment Usage
Three main trade‑offs emerged: system stability, resource cost, and usage efficiency. Limited procurement prevented retiring old machines, leading to stability problems. Insufficient resources kept test machine utilization high, preventing the stable environment from maintaining a 30 % memory headroom, which in turn hurt stability. Strict reclamation policies also degraded user experience.
2.1 Resource Shortage
Business and cluster growth, combined with procurement delays, left the test pool at 3.8 TB with an 80 % peak usage, and machines with >40 GB memory were hard to obtain.
2.2 Resource Waste
Fixed‑size memory allocations prevented automatic scaling. As services were updated, duplicate containers accumulated in both dynamic and stable environments, and reclaimed resources could not be returned to the pool.
2.3 Stability Issues
Hardware reliability: aged, out‑of‑warranty machines often failed, causing direct business impact.
Deployment complexity: a 7‑8 step initialization process could fail at any stage, and configuration replacements for databases, Redis, MQ, ZK were error‑prone.
Manual host and Nginx adjustments increased the chance of mistakes.
Lack of automatic scaling required manual environment recreation, raising time costs.
KVM‑based solutions had high maintenance overhead.
These issues generated roughly 25 environment‑related tickets per week, consuming about 8 hours of ops time. To mitigate, tools such as error analysis, VM restart, resource alerts, health monitoring, and migration utilities were built.
3 Solution: Dynamic + Stable Environments (Tag Routing)
3.1 Architecture Changes
The platform was redesigned using Docker and stable environments, replacing IP routing with tag routing. An environment now consists of multiple Docker containers and IPs (e.g., environment yyy contains services B and D with IPs 192.168.5.1 and 192.168.6.1).
Image initialization and agent setup were eliminated. Environment size is no longer bound by a single host; a single environment can host all services. Leveraging Kubernetes, a new node is added during deployment and the old one is drained, ensuring zero‑downtime.
Engineering Standardization
RD upgrades switched test configurations to production‑like settings, removing platform‑level config replacements.
Centralized Nginx
Per‑environment Nginx instances were removed; a centralized Nginx managed routing, eliminating generation errors.
Host Configuration Simplification
Unnecessary public hosts were deleted, RPC calls were migrated to a service‑management platform, and remaining hosts were resolved via internal DNS.
New Challenges and Mitigations
Tag routing introduced new concerns: IPs became non‑unique tags, changing with each deployment, affecting host configuration, login, log access, and unit testing. Solutions included wildcard sub‑domains (e.g., app‑${tag}.zhuanzhuan.com), Whistle routing rules, webshell access, historical log queries, and a tag‑based unit‑test helper. An IDEA plugin later addressed remote‑debug IP changes.
New Operational Model
The minimal management node shifted from a KVM host to a service within a tag. After a test service is promoted, the platform syncs the latest code to the stable environment and removes the test tag, reclaiming resources automatically.
Results
User‑reported issues dropped by 95 % and large‑scale tests saw virtually no environment problems.
Application time reduced from 28 minutes to under 5 minutes.
Resource consumption fell from 3200 GB to 1200 GB.
Conclusion
Within one month of design, three months of service upgrades, and a year of full rollout, Zhuanzhuan achieved substantial gains in architecture, operations, and engineering efficiency. Docker‑based environments now provide instant, interruption‑free testing with resource, performance, and efficiency improvements that are considered industry‑leading.
More technical implementation details
About the author
Chen Qiu, Zhuanzhuan Engineering Efficiency Lead, responsible for configuration management and DevOps ecosystem.
转转QA
In the era of knowledge sharing, discover 转转QA from a new perspective.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.