Achieving Full Cloud‑Native Migration: Hangzhou MingShitang’s Journey to 100% SLA
This case study details how Hangzhou MingShitang migrated its entire online‑education platform from self‑hosted IDC infrastructure to Alibaba Cloud, redesigning registration, configuration, micro‑service governance, safe release and gateway layers with MSE, Sentinel and cloud‑native technologies to attain 100% SLA, dramatically cut costs and boost performance.
Background
Before 2022 the system was deployed on a self‑managed IDC using a private Kubernetes cluster. Core components included:
Eureka for service registration
Apollo and Spring Cloud Config for configuration management
Redis, MySQL, MongoDB, Kafka, RabbitMQ, Hadoop for data storage and processing
Spring Cloud Java services with Zuul 1.0 as the gateway and Nginx as the entry point
ELK, Pinpoint and Zabbix for monitoring
Problems with the IDC Architecture
Stability: traffic spikes during peak periods caused frequent outages and SLA violations.
Elasticity: scaling required hours‑long procurement cycles.
Cost: idle IDC machines wasted resources.
Operational complexity: many self‑built services demanded specialized staff and made troubleshooting difficult.
Full Cloud Migration (2022)
A dedicated migration team partnered with Alibaba Cloud to move all workloads to the cloud, establishing the foundation for subsequent cloud‑native transformation.
Infrastructure Refactor – Registration & Configuration Center
The original stack (Eureka, Apollo, Spring Cloud Config) suffered from cluster unavailability and delayed configuration pushes. After evaluation, the team selected Alibaba Cloud MSE Nacos as a unified service registry and configuration center.
Migration steps:
Developed migration tools to export Apollo and Spring Cloud Config data.
Used MSE Sync to replicate Eureka instances to Nacos with zero‑downtime.
Deployed Nacos namespaces for environments (dev, test, pre‑prod, prod), groups for business lines, and dataId for configuration types.
Result: registration and configuration SLA reached 100% with no incidents.
Service Governance – High‑Availability Toolbox
Hystrix could not meet the >10 k QPS summer peak. The team adopted Alibaba Sentinel (AHAS edition) integrated into MSE traffic governance.
QPS‑based rate limiting
Concurrency isolation (replaces heavyweight thread‑pool isolation, saving >10× memory)
Exception‑based circuit breaking and degradation
Real‑time rule updates without service restart
Implementation SOP:
Coarse‑grained rate limiting at the gateway layer, logging hits to SLS and triggering alerts.
Fine‑grained limit‑circuit‑degrade controls at the application layer, leveraging MSE metrics for dynamic adjustments.
Safe Release – No‑Loss Down/Up
5xx errors were observed during releases due to non‑graceful shutdowns and slow start‑up health checks.
Phase 1 – Custom Solutions
Graceful down: Nacos retains instance metadata for 1 minute. Combined with a Kubernetes preStop hook that sleeps 60 seconds, pods are kept alive long enough for in‑flight requests to finish.
Graceful up: Simplified /health endpoints, removed heavy logic, and introduced delayed registration for services with long initialization.
Phase 2 – Cloud Product Capability
Enabled MSE’s built‑in no‑loss release feature, eliminating the need for custom handling.
Result: >100 Java applications achieved 100 % no‑loss down and up during releases.
Full‑Link Gray Release
Initial internal solution used the open‑source Nepxion Discovery framework, which later proved inflexible. The team switched to the MSE gray‑release product, which provides an Agent‑based, zero‑code integration for mainstream frameworks.
MST publishing system – dynamic application model.
MST traffic‑governance platform – rule management.
MST unified gateway – Go‑based WASM plugin for traffic shading.
MST static rendering service – front‑end gray capability.
MSE Agent – Java service‑to‑service tag propagation.
Release workflow: internal validation → staged rollout (1 % → 5 % → 10 % → full) with continuous monitoring.
Cloud‑Native Gateway Consolidation
Three generations of the unified entry layer:
2018‑2019: Nginx + Spring Cloud Zuul 1.0 – high configuration complexity, no hot‑load.
2022: Nginx + APISIX + Zuul 1.0 – added flexibility but introduced etcd management overhead.
2023: MSE cloud‑native gateway (commercial Higress) – merges traffic and business gateways into a single layer.
Key outcomes after migration to the MSE gateway:
SLA improved to 100 %.
Financial cost reduced by 67 % and compute cost by 75 %.
Average request latency decreased by ~5 ms.
High availability achieved through HTTPS hardware acceleration, kernel tuning, and Envoy parameter optimization.
Scalability enhanced: WASM gray‑plugin migrated to the cloud‑native gateway with second‑level upgrade/rollback.
Results Summary
Registration & configuration center SLA: 100 %.
Service governance: dozens of Sentinel rules deployed, eliminating incidents caused by traffic spikes or downstream slow calls.
Release safety: 100 + Java services with 100 % no‑loss down/up.
Gray release: full‑link traffic shading with staged rollout and instant rule updates.
Gateway: 100 % SLA, 67 % cost reduction, 75 % compute reduction, ~5 ms latency improvement.
Future Direction
With stability secured, the focus shifts to improving development‑test quality, accelerating iteration, and exploring AI‑driven innovations. The organization plans to deepen integration of cloud computing and AI to drive further educational innovation.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
