Achieving 50% Cost Cut with Cloud‑Native Architecture: A Flexible Workforce Platform Case
Facing poor observability, high resource waste, and unstable releases, QingTuan’s flexible‑workforce platform transformed its monolithic and SOA systems into a cloud‑native micro‑service architecture using Alibaba Cloud ACK, MSE, ARMS, and Prometheus, achieving higher availability, elastic scaling, and up to 50% infrastructure cost reduction.
Architecture Evolution
The system started as a monolithic application (2014), migrated to a Service‑Oriented Architecture (SOA), and finally adopted a Spring Cloud‑based micro‑service architecture. Early micro‑services used Eureka for service discovery and Spring Cloud Gateway as the API gateway.
Cloud‑Native Migration
In 2021 the platform was re‑architected on Alibaba Cloud Container Service for Kubernetes (ACK) Serverless. Key migration steps:
Replace Eureka with MSE‑based Nacos for service registration and configuration.
Deploy ACK clusters across multiple Availability Zones (AZs) and use node‑pool isolation to separate business lines.
Leverage CSI plugins ( cloud‑disk, OSS, NAS) for stateful workloads such as databases and Redis.
Enable VPC‑direct networking via the Terway plugin, allowing pods to communicate with existing VPC resources.
Integrate Alibaba Cloud ARMS for Java application performance tracing and MSE for traffic governance.
Scheduling, Elasticity and Resource Isolation
Three complementary strategies are used:
Multi‑AZ deployment with dedicated node pools for each business line, providing physical‑like isolation.
Horizontal Pod Autoscaling (HPA) combined with custom elastic‑scale policies (e.g., KEDA) to balance cost and stability.
CSI‑based storage for stateful services, ensuring persistent volumes for databases, Redis, etc.
Case 1 – Zero‑downtime rolling update across zones : Services are deployed in Hangzhou H and K zones. Node‑pool labels guide the scheduler to spread replicas. Kubernetes rolling updates with readiness probes guarantee that at least one replica remains healthy while the other is upgraded.
Case 2 – Metric‑driven scaling with KEDA : Business scenarios such as event tracking, ad delivery and peak‑activity campaigns emit Prometheus metrics (e.g., request rate, queue depth). KEDA watches these metrics and triggers the Kubernetes API to add or remove pods automatically.
Traffic Management
Traffic is controlled at three layers:
Ingress and request routing via Alibaba Cloud APISIX gateway.
Service‑level traffic governance, gray releases, and region‑aware routing using MSE micro‑service engine.
Asynchronous processing and delayed messaging with Kafka and RocketMQ.
A Backend‑For‑Frontend (BFF) layer adapts data formats for C‑end, B‑end, Android and iOS clients before invoking backend services.
Observability and Monitoring
The observability stack combines:
ARMS – Java‑level tracing, latency breakdown, and alerting.
Prometheus – Scrapes custom business metrics exposed by Java clients.
Grafana – Dashboards for visualizing Prometheus data.
Cloud Monitor – Infrastructure metrics for ECS, PolarDB, and message queues.
Typical workflow: a slow third‑party API appears as increased latency in ARMS traces; the corresponding metric spikes in Prometheus trigger an alert; Grafana dashboards help pinpoint the affected service; Cloud Monitor shows whether CPU or memory saturation contributed.
Release Practices
Gray deployment and graceful shutdown are implemented through MSE agents:
During a rolling update, new pods are started first. The MSE agent registers a preStop hook on the old pods; when the hook runs, MSE notifies callers to stop sending traffic to the retiring instance.
For a new version, MSE exposes a health‑check endpoint. Only after the health check passes does MSE gradually shift a small percentage of traffic (e.g., 0.1 %) to the new pods, allowing warm‑up before full traffic is routed.
Results
Infrastructure cost reduced by ~50 % by replacing sparse ECS instances with dense container workloads.
High availability and elastic scheduling support a user base of >73 million.
Comprehensive monitoring shortens MTTR and enables safe gray releases and graceful shutdowns.
Future Directions
Adopt a service‑mesh (MSE or open‑source) to provide language‑agnostic traffic governance for Java, Python, Go, etc.
Explore GraalVM native images to further lower memory footprint and improve cold‑start latency.
Introduce chaos engineering experiments to proactively improve system stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
