Operations 14 min read

How We Built a Stable Offline Testing Environment with Cloud‑Native Practices

This article details the challenges of managing a complex, multi‑layered offline testing environment at KuJiaLe, outlines the standardization of baseline, functional, and integration environments, and explains the comprehensive stability measures—including infrastructure upgrades, automated checks, emergency response, and daily operations—that dramatically improved reliability.

Qunhe Technology Quality Tech

Dec 12, 2023

How We Built a Stable Offline Testing Environment with Cloud‑Native Practices

Environment Construction Background

KuJiaLe's front‑ and back‑end architecture follows a typical micro‑service model with front‑end, middle‑end, and infrastructure layers. Service dependencies are intricate and increase as business grows, making offline environments more complex than online ones and harder to maintain.

The tool front‑end carries extensive business logic and algorithms, consisting of:

kaf framework, which includes common components and operations

business micro‑applications with tangled dependencies

The overall structure is highly layered, cyclically dependent, and tightly coupled, which adds difficulty to environment governance.

Offline environments are a crucial foundation in the product development cycle, tightly linked to every stage from requirement analysis to release, directly affecting iteration smoothness.

Challenges

As business expands, the number of services and offline test environments has surged, raising daily maintenance difficulty and stability requirements.

Offline Environment Standardization

In the early stage of environment governance, containerization was not widely adopted, making environment construction complex with numerous configurations and resource requests.

Long and unstable chain dependencies, along with a lack of unified usage standards, caused parallel testing issues and mutual blocking.

The root causes are limited middleware/infrastructure capabilities and unclear standards.

With the emergence of SOA routing, we defined a baseline (stable) environment whose version defaults to "default"; requests without a matching version are routed to this default.

Feature (fe) environments are built on the baseline, as are services, databases, etc. Integration test (sit) environments share the full set of services, databases, and middleware with the baseline.

We standardized environment flow: after functional testing, code moves to sit; after integration testing, it proceeds to beta, then prod. Deployments to prod automatically flow to stable, ensuring stable code stability.

Databases and middleware are split into separate online and offline sets, simplifying maintenance and eliminating data sync issues.

3.1 Foundations

Self‑healing and high availability are achieved by migrating databases to Kubernetes, enabling minute‑level recovery.

We enabled three types of Kubernetes probes—Readiness, Startup, and Liveness—to greatly improve service survivability.

Horizontal Pod Autoscaler (HPA) dynamically scales pods based on CPU usage, addressing performance‑related environment issues.

Critical services run at least two pods to avoid single points of failure.

After a prod deployment, the same configuration is automatically deployed to stable; only release‑branch code is allowed in stable, and configuration sync ensures baseline stability.

Backup relies on Ceph, an open‑source distributed storage system that replicates data across nodes and provides fault detection and automatic recovery.

3.2 Pre‑emptive Prevention

Given the inevitability of offline environment issues, we defined core business links and built automated inspections for sit and stable environments to detect problems early.

We also implemented middleware and service health checks.

Changes related to offline environments are integrated into a change‑control system, aiding rapid issue localization.

For front‑end deployment, we introduced a "prepare" version before the default sit version; only after passing core‑function checks in prepare can code flow to default, preventing runtime errors from propagating.

3.3 Emergency Response

When problems arise, time is critical. We built a batch rollback capability for front‑end micro‑applications, enabling minute‑level recovery.

Using Kubernetes, we perform in‑place restarts without pulling images, allowing selective batch restarts of thousands of pods within minutes, addressing Zookeeper glitches and mass service deregistration.

An emergency dashboard aggregates alerts across API, application, host, and middleware layers, providing rapid insight into the current state for quick root‑cause analysis.

3.4 Daily Operations

A dedicated virtual team operates daily using the built tools and capabilities to manage the offline environment.

We monitor specific metrics to assess stability; after implementing the governance mechanisms, environment block incidents and durations have shown a downward trend.

Conclusion and Outlook

Standardization defined a baseline environment with accompanying processes, now effectively supporting daily development and testing activities.

New stability initiatives focus on foundational construction, pre‑emptive prevention, and emergency response, forming a long‑term mechanism through daily operations.