Evolution of Zhuanzhuan's Test Environments: From Monolithic Setups to Docker‑Based Dynamic and Stable Platforms
This article details how Zhuanzhuan transformed its testing infrastructure from a handful of monolithic servers to a Docker‑driven, tag‑routed dynamic and stable environment, addressing resource shortages, waste, and stability issues while achieving significant reductions in deployment time, resource consumption, and user‑reported problems.
1 Test Environment Evolution
1.1 Monolithic Environment
When Zhuanzhuan was founded in 2017, it operated five 64 GB machines that hosted five complete test environments, enough for daily development and testing needs. One machine was assigned to developers and the others to testers, and conflicts between parallel branches were resolved through coordination.
1.2 Dynamic Environment + Stable Environment
With rapid micro‑service expansion, the number of services and parallel branches grew, making the shared monolithic environment untenable. A new model was introduced: a dynamic environment that deployed only the modified services and required supporting services, and a stable environment that mirrored production. An environment platform was built to manage the full lifecycle from request to reclamation, partially satisfying the new demands.
Problem: after a request entered the stable environment, it could not reach services in the dynamic environment, forcing all upstream services, MQ producers, etc., to be deployed on the test machine even when unchanged, leading to rapidly increasing resource consumption as the cluster grew.
1.3 Dynamic Environment + Stable Environment (IP Routing)
To prioritize traffic to the dynamic environment and fall back to the stable one only when necessary, IP routing was implemented as a lane identifier. This reduced resource usage by about 30 % after launch. Over the next two years, despite hardware shortages and supply constraints, the solution kept the testing platform functional, though new issues began to surface.
2 Problems in Environment Usage
The three main tension points are system stability, resource cost, and usage efficiency. Limited procurement budgets keep old machines in service, causing stability issues. Insufficient resources prevent maintaining a 30 % free‑memory threshold in stable environments, leading to frequent reclamation that hurts user experience. The existing architecture cannot simultaneously satisfy all three goals.
2.1 Insufficient Resources
Business growth and cluster expansion, combined with hardware procurement difficulties, have tightened resources. The test pool holds 3.8 TB, with peak utilization at 80 %. Over 40 GB machines are scarce, scattered across about 20 physical hosts.
2.2 Resource Waste
Fixed‑size memory allocations prevent automatic scaling. As services evolve, duplicate instances accumulate in both dynamic and stable environments, and reclaimed resources cannot be returned to the pool.
2.3 Stability Issues
Machine stability – aging, out‑of‑warranty hardware frequently fails (≈5 machines per month during peak periods), directly impacting business.
Deployment Process Overview:
System stability – the 7‑8 step initialization can fail at any stage, and legacy configuration replacements (DB, Redis, MQ, ZK) are error‑prone.
Service addition/removal requires manual Nginx and host updates, leading to mistakes.
Insufficient memory prevents automatic scaling; manual re‑provisioning is time‑consuming.
KVM‑based resources have poor ecosystem support and high maintenance cost.
These issues generate roughly 25 environment‑related tickets per week, consuming about 8 hours of ops time. To mitigate, a suite of admin tools (error analysis, VM restart, resource alerts, health monitoring, migration utilities) was built, but overall maintenance cost remains high and user satisfaction low.
3 Solution: Dynamic + Stable Environment with Tag Routing
3.1 Solution Overview
Underlying Architecture Change – Adopt Docker + stable environment; replace IP routing with tag routing. An environment now consists of multiple Docker containers and IPs (e.g., environment yyy contains services B and D with IPs 192.168.5.1 and 192.168.6.1).
Docker and agent initialization steps are eliminated. Environment size is no longer bound by a single host; a single environment can host all Zhuanzhuan services. Leveraging Kubernetes, a new node is spun up for deployment, and the old node is retired once the new one is healthy, ensuring zero‑downtime.
Engineering Standardization – Upgrade RD services to use production‑grade configurations, removing the need for platform‑level config replacement.
Nginx Centralization – Remove per‑environment Nginx instances and use a centralized Nginx managed by operations.
Host Configuration – Eliminate unnecessary shared hosts; migrate RPC host calls to a service‑management platform and resolve remaining hosts via internal DNS.
New Issues and Mitigations
Tag routing introduces new challenges: IPs become tags, changing on each deployment, affecting host configuration, login, log access, and unit testing. The following measures were applied:
Wildcard domain support – e.g., app-${tag}.zhuanzhuan.com resolves to the appropriate tag.
Whistle configuration addition – route requests through a central Nginx with tag‑based filters.
WebShell integration – abstracts away IP changes for developers.
Historical log query feature – retains access to logs after IPs are recycled.
Local tag‑routing UI – replaces manual IP entry in unit tests; an IDE plugin links the environment platform for remote debugging.
New Operational Mode – The minimal management node shifts from a KVM host to a single service within a tag. After a test service is promoted, the platform syncs the latest code to the stable environment and automatically deletes the test tag, reclaiming resources.
Results
User‑reported issues dropped by 95 %, and large‑scale testing no longer encounters environment problems.
Application time reduced from 28 minutes to under 5 minutes.
Resource consumption fell from 3200 GB to 1200 GB.
Conclusion
Within one month of finalizing the design, development was completed; three months later the service upgrade was finished, and a full year after rollout the solution was fully adopted. Docker‑based environments transformed the testing ecosystem: environments are provisioned instantly, no manual effort or downtime is required, and the minimal management unit is now a single service. The underlying architecture is considered industry‑leading, delivering balanced improvements in resource usage, performance, and efficiency, with no foreseeable need for major structural changes.
More technical implementation details
About the author
Chen Qiu, Zhuanzhuan Engineering Efficiency Lead, responsible for configuration management and DevOps ecosystem. Feel free to leave comments and share knowledge.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.