How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration
This article details a four‑stage migration project that rebuilt international services on a cloud‑native stack, introducing temporary Istio monitoring, standardized change processes, Helm‑based deployments, and full microservice integration while sharing practical quality‑assurance lessons and pitfalls.
1. Introduction
As international business expands, the existing global architecture faced severe stability challenges and frequent failures. A bottom‑up reconstruction, dubbed the "Moon Landing" project, was launched with cross‑team collaboration to improve reliability while maintaining stability goals.
2. Phase One – Support Monitoring & Alerts (Temporary Solution)
2.1 Monitoring Support for Istio Architecture
The Istio‑based global services lacked maintenance, making conventional monitoring, service governance, and change‑process standards unusable. The immediate need was to detect problems in the live environment and respond quickly. A temporary monitoring solution was built for Istio, to be replaced once services migrated to the standard microservice stack.
Basic Monitoring & Alert Construction (metrics)
Istio provides a limited set of built‑in metrics (see image).
Metrics are scraped and analyzed by Prometheus.
Thresholds trigger alerts that are routed to the responsible business owners.
Distributed Tracing with Jaeger (call chain)
Istio sidecar enables Jaeger client collection.
Jaeger server aggregates trace data.
Jaeger UI visualizes the call chain.
2.2 Basic Metrics Supported
From zero to one, the metric set is far smaller than domestic standards, but it finally gave the global services sustainable monitoring and alerting capabilities. Two different monitoring templates were used to accommodate version differences between Korean and other clusters.
2.3 Quality‑Assurance Practices
Alarm War – Continuous Remediation
Before monitoring, the team received hundreds of P0/P1 alerts daily, many duplicated, causing constant anxiety. By collaborating with business owners and enforcing strict alarm handling, the number of critical alerts dropped to single digits within three months.
Standardized Alarm Handling Process
Alarm handling evolved from ad‑hoc group messages to a formal on‑call rotation with clear owners, procedures, and escalation paths, improving response speed and reducing noise.
Typical alarm resolution steps:
Identify alarm type – response‑time anomaly, error‑rate anomaly, or network/middleware anomaly.
Assign priority, owner, and fix deadline based on type.
3. Phase Two – Standardize Change Process
All changes to databases, Redis, system, and business configurations now require approval, reflecting hard‑learned lessons from past incidents.
3.1 International Services Adopt ConfigMap
ConfigMap serves as a native Kubernetes configuration store, exposing settings as environment variables. However, lack of change control and open permissions caused high‑severity failures when mis‑configured.
3.2 QA Highlights – Common Issues
Application cannot connect to Config Center server – caused by Istio sidecar not starting; solution: increase client timeout and enable fast‑fail fallback.
Missing log output from Config Center client – due to absent log4j2 dependency; temporary fix by disabling log.
Istio watch request 504 timeout – requires further investigation.
4. Phase Three – Unified Deployment Process
4.1 Helm‑Based Release
Because the global services differ from domestic ones, Helm was chosen as a deployment tool to minimize changes. The work includes adding Helm support to the release system, providing generic configuration templates, and migrating services to the release pipeline.
Develop Helm deployment capability in the release system.
Support generic configuration templates.
Migrate international services to the release system.
Standardize configuration templates for all services.
4.2 QA Practices
Configuration template validation – ensure consistency between system and business parameters via automated scripts.
Application smooth migration – adopt minimal‑change, gray‑release, and rollback strategies to handle coexistence of old and new pods.
5. Phase Four – Full Integration into Microservice Ecosystem
5.1 Migration Goals
Integrate services into the microservice governance platform.
Enable deployment via the existing release system.
Expose overseas clusters to the central monitoring system.
5.2 QA Practices
White‑box testing – code review, configuration review, monitoring observation.
Gray‑release & rollback – route switching, client call switching, log toggling.
Automation – interface diff checks, core API automated tests.
6. Cross‑Region & Cross‑Team Collaboration
Define clear objectives.
Ensure concrete implementation.
Track progress regularly.
Assist each other in troubleshooting.
Expose and mitigate risks.
Qunhe Technology Quality Tech
Kujiale Technology Quality
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.