How Our Team Built a Stable SIT Environment: Lessons in Test Environment Governance
This article documents the step‑by‑step practices of a six‑person test‑environment availability team that unified middleware, streamlined deployment pipelines, piloted business usage, introduced monitoring and recovery mechanisms, and created a comprehensive SIT environment handbook to improve integration testing stability and operational efficiency.
Testing environments are essential for every software development team. Regardless of changes in underlying capabilities, the methods for building upper‑level environment availability share similarities. This article records the practices of our internal offline environment availability group, aiming to provide a stage‑wise summary for future reference.
1. Preface
Like most companies adopting DevOps, our test‑environment governance works hand‑in‑hand with DevOps practices. Generally, environment governance precedes DevOps, but DevOps accelerates governance. The process crosses both hardware and software capability building phases.
The concrete construction content is illustrated in Figure 1.
Our test‑environment construction started long ago and grew alongside daily product development. Although multiple environments existed, many required capabilities were not fully built or used. For example, the integration testing environment (SIT) was underutilized, leading to:
Broken upstream/downstream test chains and unverified scenarios, causing missed tests and production incidents.
Unstable test environments, resulting in five‑minute tests followed by two‑hour investigations, hurting development efficiency.
In response, a six‑person offline environment availability team was formed in the second half of last year, initially called the SIT environment availability group, with the goal of stabilizing the integration testing environment. With hardware capability largely completed, the focus shifted to software capability construction, while hardware would continue iterative updates.
2. Driving SIT Environment Construction
After the team was formed, we began with a SIT environment assessment and then carried out several key actions.
2.1 Unifying and Adapting Environment Components
Field visits and legacy documentation revealed that most hardware work was near completion, leaving only historical issues and non‑core processes to polish.
Two main problems were identified:
Middleware components were isolated across offline environments, leading to high configuration and maintenance costs.
The call chain between front‑end projects and corresponding back‑end services needed to be opened.
We addressed these by:
Re‑organizing underlying components so that offline environments share a single set of middleware and storage (Redis, Memcached, MySQL, Mongo, Kafka, etc.). The relationship between internal network and infrastructure is shown in Figure 2.
Modifying the front‑end version‑selection feature and the front‑end code‑release system to allow front‑end deployments to switch and access SIT back‑end services.
Explanation:
The "environment" column represents the daily business release flow; the perf environment is optional, and after prod release the stable environment is automatically deployed.
Stable shares the same middleware and storage as feature, SIT, and perf.
Offline environments isolate middleware logic but share data storage; online environments also isolate middleware but share storage; offline and online environments are physically isolated in both middleware and storage.
The business layer diagram shows all back‑end services and their interactions.
Technical details of the support layer, inter‑service calls, and overall data flow are omitted.
These actions made the connection and flow between offline environments smooth and laid the foundation for chained releases (functional test, integration test, pre‑release, production, stable).
2.2 Business Pilot Promotion
When promoting SIT, each business unit operated independently. For example, Business Group A faced three issues:
SIT was not mandated as the integration test environment, so full integration verification often did not occur there.
Applications deployed to SIT could have their branches switched at any time.
Some services and dependent services from other groups were not present in SIT.
These gaps meant SIT was underused, akin to a newly built highway without traffic rules.
After identifying the problems, we selected a few business groups for a SIT pilot, requiring them to list core and non‑core applications to be deployed on the master branch, as shown in Table 1.
Comprehensive business regression testing on SIT then uncovered issues such as missing dependent services, mismatched SOA versions, or incorrect database configurations. After resolving these, the experience was applied to the remaining groups, culminating in full‑chain regression for all business units.
2.3 Application Deployment Process Bottlenecks
Relying solely on manual checks or subjective process constraints would eventually lead to disorder in SIT deployments.
We therefore collaborated with the DevOps team to remodel the back‑end application release management system: SIT deployment became the first step of the release pipeline, while stable environment updates were moved to the final step, automatically triggered only after a successful prod deployment.
We also introduced a “gentlemen’s agreement”: while developers may deploy any branch to SIT for temporary verification, they must ensure service stability.
Deployment statistics are tracked in Table 2, and abnormal cases are corrected and communicated.
2.4 Front‑End Code Deployment Standards
The three actions above primarily targeted back‑end services; front‑end deployment follows the same pattern, though historically the front‑end had many deployment methods. The current state is shown in Table 3, and the front‑end platform team is working on unifying the Pub release process.
2.5 Environment Monitoring and Recovery Mechanism
Since SIT began supporting full‑chain business flow, various environment issues emerged. We set up a feedback group and Confluence documentation to record problems, forming a fault knowledge base for future troubleshooting and fine‑grained monitoring.
Top‑3 issues in early 2019Q4 (38 records) were service unavailability, middleware instability, and API 500 errors; in 2020Q1 (15 records) they were service unavailability, API 500, and service deployment stoppage. The classification is shown in Figure 7.
Initially, problem discovery relied on manual reporting in a group chat, followed by round‑robin support to locate owners and resolve issues—a low‑efficiency process.
We now use a refined workflow (Figure 8) that adds fine‑grained monitoring, pushes alerts to responsible owners, and leverages CMDB data for accurate owner information.
Monitoring metrics include system health and non‑200 API responses, with customizable sensitivity per service, addressing delayed detection and slow owner identification.
2.6 Environment Usage Manual
The SIT promotion and maintenance process is both procedural and cultural. To help new team members, we created a "SIT Environment Usage Manual" that captures successful patterns and failure lessons, covering environment introduction, configuration, fault knowledge base, and application owner maintenance.
The manual’s structure is illustrated in the mind‑map below.
3. Conclusion
Through the actions described, the SIT environment now fulfills its integration testing role. The manual is accessible via the team’s problem‑handling group and the internal test‑management portal. Maintaining high availability for SIT remains a long‑term effort, and the experience will be extended to other offline environments in future work.
Qunhe Technology Quality Tech
Kujiale Technology Quality
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.