Avoiding Service Warm‑up Misjudgment in Alibaba Cloud MSE Graceful Startup
This article explains why MSE's graceful startup misinterprets Kubernetes liveness probes as service warm‑up, shows how to configure ignored paths, discusses alternative TCP‑based probes, and outlines the three core no‑loss deployment features—delayed registration, small‑traffic warm‑up, and readiness checks—plus gateway warm‑up steps.
Introduction
The case study originates from Alibaba Cloud's technical service team working with YiYi Interconnect, a subsidiary of Geely Group that operates large‑scale electric‑vehicle battery‑swap stations. YiYi adopted Alibaba Cloud MSE (Microservice Engine) early to meet the high stability demands of its microservice architecture spanning vehicle, station, and cloud platforms.
Problem: Service Warm‑up Misjudgment
During a routine upgrade, YiYi observed that traffic loss occurred because the warm‑up phase started earlier than the service registration time, violating the expected loss‑less rollout. The execution order of the no‑loss startup steps, illustrated in the original diagrams, shows two issues: (1) warm‑up begins before registration, and (2) the QPS curve after warm‑up is a uniform pulse.
Root Cause Analysis
The misjudgment stems from the Kubernetes liveness probe calling the Spring Boot Actuator /actuator/health endpoint. MSE treats any successful call to this health endpoint as the start of service warm‑up because the health check is intercepted by the DispatcherServlet, which is also where MSE inserts its warm‑up detection logic. Consequently, the liveness probe is mistakenly counted as a warm‑up request.
Direct Fix: Ignoring Specific Paths
Adding a configuration entry that tells MSE to ignore the health‑check URL resolves the issue:
profile_micro_service_record_warmup_ignored_path=<skip_urls, e.g., /actuator/health, multiple URLs separated by commas>This approach works but introduces extra configuration and is invasive for large organizations.
Alternative Approaches
TCP‑based liveness probe: Switching the probe to TCP avoids invoking the Actuator endpoint, eliminating the false warm‑up trigger. However, forcing TCP may not suit all scenarios where HTTP health checks provide richer diagnostics.
Automatic probe detection or RPC tagging: Future product directions include automatically recognizing and ignoring user‑defined liveness/readiness probes or tagging inbound RPC traffic so that MSE can differentiate genuine business calls from health checks. The tagging method requires both consumer and provider services to embed and recognize a custom header, offering the most precise solution but demanding full MSE adoption.
MSE No‑Loss Startup Features
MSE provides three core capabilities:
Delayed registration: The service waits passively until internal components finish initialization before registering with the service registry.
Small‑traffic warm‑up: Traffic is gradually increased based on each instance’s start time, using weighted load‑balancing to avoid sudden load spikes.
Readiness check: An integrated readiness probe ensures that a pod is only marked ready after it has successfully registered, preventing traffic from being routed to an unregistered instance during a rolling update.
Cloud‑Native Gateway Warm‑up
The MSE Cloud‑Native Gateway also supports loss‑less deployment. By configuring a 60‑second warm‑up period in the load‑balancing policy and gradually shifting traffic weight across rolling‑update endpoints, the gateway can pre‑heat services without causing 5XX errors.
Note: Monitoring tools that sample at minute granularity may not display the smooth traffic increase; increasing the warm‑up duration or using second‑level sampling can make the effect observable.
Conclusion
While MSE’s graceful startup has matured, integration challenges remain due to diverse probe configurations and service meshes. The recommended practice is to combine path‑ignoring, appropriate probe protocols, and, when possible, RPC tagging to achieve truly loss‑less rollouts. Ongoing product improvements aim to automate these mitigations, reducing configuration overhead and lowering the adoption barrier for cloud‑native microservice deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
