Cloud Native 14 min read

How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration

This article details a four‑stage migration project that rebuilt international services on a cloud‑native stack, introducing temporary Istio monitoring, standardized change processes, Helm‑based deployments, and full microservice integration while sharing practical quality‑assurance lessons and pitfalls.

Qunhe Technology Quality Tech

Jun 8, 2021

How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration

1. Introduction

As international business expands, the existing global architecture faced severe stability challenges and frequent failures. A bottom‑up reconstruction, dubbed the "Moon Landing" project, was launched with cross‑team collaboration to improve reliability while maintaining stability goals.

2. Phase One – Support Monitoring & Alerts (Temporary Solution)

2.1 Monitoring Support for Istio Architecture

The Istio‑based global services lacked maintenance, making conventional monitoring, service governance, and change‑process standards unusable. The immediate need was to detect problems in the live environment and respond quickly. A temporary monitoring solution was built for Istio, to be replaced once services migrated to the standard microservice stack.

Basic Monitoring & Alert Construction (metrics)

Istio provides a limited set of built‑in metrics (see image).

Metrics are scraped and analyzed by Prometheus.

Thresholds trigger alerts that are routed to the responsible business owners.

Distributed Tracing with Jaeger (call chain)

Istio sidecar enables Jaeger client collection.

Jaeger server aggregates trace data.

Jaeger UI visualizes the call chain.

2.2 Basic Metrics Supported

From zero to one, the metric set is far smaller than domestic standards, but it finally gave the global services sustainable monitoring and alerting capabilities. Two different monitoring templates were used to accommodate version differences between Korean and other clusters.

2.3 Quality‑Assurance Practices

Alarm War – Continuous Remediation

Before monitoring, the team received hundreds of P0/P1 alerts daily, many duplicated, causing constant anxiety. By collaborating with business owners and enforcing strict alarm handling, the number of critical alerts dropped to single digits within three months.

Standardized Alarm Handling Process

Alarm handling evolved from ad‑hoc group messages to a formal on‑call rotation with clear owners, procedures, and escalation paths, improving response speed and reducing noise.

Typical alarm resolution steps:

Identify alarm type – response‑time anomaly, error‑rate anomaly, or network/middleware anomaly.

Assign priority, owner, and fix deadline based on type.

3. Phase Two – Standardize Change Process

All changes to databases, Redis, system, and business configurations now require approval, reflecting hard‑learned lessons from past incidents.

3.1 International Services Adopt ConfigMap

ConfigMap serves as a native Kubernetes configuration store, exposing settings as environment variables. However, lack of change control and open permissions caused high‑severity failures when mis‑configured.

3.2 QA Highlights – Common Issues

Application cannot connect to Config Center server – caused by Istio sidecar not starting; solution: increase client timeout and enable fast‑fail fallback.

Missing log output from Config Center client – due to absent log4j2 dependency; temporary fix by disabling log.

Istio watch request 504 timeout – requires further investigation.

4. Phase Three – Unified Deployment Process

4.1 Helm‑Based Release

Because the global services differ from domestic ones, Helm was chosen as a deployment tool to minimize changes. The work includes adding Helm support to the release system, providing generic configuration templates, and migrating services to the release pipeline.

Develop Helm deployment capability in the release system.

Support generic configuration templates.

Migrate international services to the release system.

Standardize configuration templates for all services.

4.2 QA Practices

Configuration template validation – ensure consistency between system and business parameters via automated scripts.

Application smooth migration – adopt minimal‑change, gray‑release, and rollback strategies to handle coexistence of old and new pods.

5. Phase Four – Full Integration into Microservice Ecosystem

5.1 Migration Goals

Integrate services into the microservice governance platform.

Enable deployment via the existing release system.

Expose overseas clusters to the central monitoring system.

5.2 QA Practices

White‑box testing – code review, configuration review, monitoring observation.

Gray‑release & rollback – route switching, client call switching, log toggling.

Automation – interface diff checks, core API automated tests.

6. Cross‑Region & Cross‑Team Collaboration

Define clear objectives.

Ensure concrete implementation.

Track progress regularly.

Assist each other in troubleshooting.

Expose and mitigate risks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring cloud-native Deployment Istio helm quality-assurance

Written by

Qunhe Technology Quality Tech

Kujiale Technology Quality

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.