Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide
This article analyzes the common pain points of Alibaba Cloud Serverless App Engine (SAE) deployments—slow release times, opaque status details, and black‑box instance startup—then presents a visualized, observable, and explainable solution that pinpoints bottlenecks, offers concrete optimizations, and demonstrates the resulting performance improvements.
User Pain Points in SAE
When application scale grows, the number of services on SAE explodes, making the frequent "publish" operation a bottleneck for developers and operators. Interviews with enterprise SAE customers revealed three major issues:
Publish latency ranging from 5 minutes to half an hour, severely slowing development cycles.
Release details expose only high‑level status, preventing users from seeing which step is slow.
Instance startup is a black box; users cannot observe detailed state changes or real‑time progress.
Solution Overview
To address these problems we designed a three‑layer approach: Observability (show the exact problem), Explainability (help users understand where the delay occurs), and Optimizations (provide actionable fixes). The implementation visualizes the release pipeline and instance lifecycle, exposing only the most useful information while hiding unnecessary complexity.
Observability – Making Problems Visible
SAE’s release engine processes a ChangeOrder that contains multiple Pipeline batches, each batch composed of Task steps and further Stage sub‑steps, forming a tree‑like structure. By aggregating logs from the SLS tenant‑events and sae‑pod‑queue logstores, we extract the following key data:
For each release, the longest‑running Task is identified (typically the "Execute Application Deploy" sub‑task).
For each batch, the associated instances are linked, allowing us to trace instance‑level startup phases.
These metrics are displayed in the release list as a top‑N latency ranking, enabling users to instantly spot the slowest components.
Explainability – Understanding the Bottleneck
We map the raw Kubernetes events to a simplified timeline that groups similar events and removes noise (init‑containers, sidecars, duplicate entries). The resulting stages are:
Resource Scheduling : time spent by the scheduler placing the pod on a node.
Instance Preparation : pulling images for init‑containers and sidecars.
Image Pull Start : pulling the user’s business container image.
Image Pull End : successful image download.
Container Creation : creating the main container.
Process Start : starting the main process inside the container.
Instance Ready : the pod becomes ready to receive traffic.
By separating platform‑side time from user‑side time, users can quickly determine whether the delay originates from the SAE platform or their own workload.
Optimizations – Reducing Publish Time
Based on the identified bottlenecks we provide concrete tuning options:
Batch Deployment Tuning : Adjust batch size and parallelism to reduce overall release duration.
Image Acceleration : Enable DADI‑based block‑device image caching (ACREE image repository) to dramatically cut image pull time.
CPU Burst : Allow the application to temporarily use up to twice the allocated CPU during startup, which benefits Java workloads.
Java Startup Acceleration : Activate the built‑in Java startup boost feature.
Documentation links for each feature are provided in the original article.
Implementation Details
Two custom components were built:
sae‑kube‑eventer : Streams all cluster events to the SLS tenant‑events logstore, attaching the SAE application ID as metadata.
sae‑kube‑exporter : Sends full pod status YAML histories to the SLS sae‑pod‑queue logstore.
The event processing pipeline consists of consumption, filtering (removing init‑container and sidecar noise, aggregating duplicates), and generation of a unified event model used by the front‑end visualizations.
Results – Visualized Release Experience
After the enhancements, users can see:
Top‑N latency distribution directly on the release list page.
Step‑by‑step trace of each release, with clear labels for platform‑side vs. user‑side time.
Detailed timing for normal releases, image‑pull failures, and health‑check failures.
These visual cues dramatically improve troubleshooting speed and overall developer productivity.
Future Plans
SAE will continue to optimize runtime performance, focusing on image acceleration, Java startup, Java runtime enhancements, and instance anomaly diagnosis, further strengthening its cloud‑native, serverless capabilities.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
