Cloud Native 11 min read

Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide

This article analyzes the common pain points of Alibaba Cloud Serverless App Engine (SAE) deployments—slow release times, opaque status details, and black‑box instance startup—then presents a visualized, observable, and explainable solution that pinpoints bottlenecks, offers concrete optimizations, and demonstrates the resulting performance improvements.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Diagnosing Slow Deployments in Alibaba Cloud SAE: A Visualized, Step‑by‑Step Guide

User Pain Points in SAE

When application scale grows, the number of services on SAE explodes, making the frequent "publish" operation a bottleneck for developers and operators. Interviews with enterprise SAE customers revealed three major issues:

Publish latency ranging from 5 minutes to half an hour, severely slowing development cycles.

Release details expose only high‑level status, preventing users from seeing which step is slow.

Instance startup is a black box; users cannot observe detailed state changes or real‑time progress.

Solution Overview

To address these problems we designed a three‑layer approach: Observability (show the exact problem), Explainability (help users understand where the delay occurs), and Optimizations (provide actionable fixes). The implementation visualizes the release pipeline and instance lifecycle, exposing only the most useful information while hiding unnecessary complexity.

Observability – Making Problems Visible

SAE’s release engine processes a ChangeOrder that contains multiple Pipeline batches, each batch composed of Task steps and further Stage sub‑steps, forming a tree‑like structure. By aggregating logs from the SLS tenant‑events and sae‑pod‑queue logstores, we extract the following key data:

For each release, the longest‑running Task is identified (typically the "Execute Application Deploy" sub‑task).

For each batch, the associated instances are linked, allowing us to trace instance‑level startup phases.

These metrics are displayed in the release list as a top‑N latency ranking, enabling users to instantly spot the slowest components.

Explainability – Understanding the Bottleneck

We map the raw Kubernetes events to a simplified timeline that groups similar events and removes noise (init‑containers, sidecars, duplicate entries). The resulting stages are:

Resource Scheduling : time spent by the scheduler placing the pod on a node.

Instance Preparation : pulling images for init‑containers and sidecars.

Image Pull Start : pulling the user’s business container image.

Image Pull End : successful image download.

Container Creation : creating the main container.

Process Start : starting the main process inside the container.

Instance Ready : the pod becomes ready to receive traffic.

By separating platform‑side time from user‑side time, users can quickly determine whether the delay originates from the SAE platform or their own workload.

Optimizations – Reducing Publish Time

Based on the identified bottlenecks we provide concrete tuning options:

Batch Deployment Tuning : Adjust batch size and parallelism to reduce overall release duration.

Image Acceleration : Enable DADI‑based block‑device image caching (ACREE image repository) to dramatically cut image pull time.

CPU Burst : Allow the application to temporarily use up to twice the allocated CPU during startup, which benefits Java workloads.

Java Startup Acceleration : Activate the built‑in Java startup boost feature.

Documentation links for each feature are provided in the original article.

Implementation Details

Two custom components were built:

sae‑kube‑eventer : Streams all cluster events to the SLS tenant‑events logstore, attaching the SAE application ID as metadata.

sae‑kube‑exporter : Sends full pod status YAML histories to the SLS sae‑pod‑queue logstore.

The event processing pipeline consists of consumption, filtering (removing init‑container and sidecar noise, aggregating duplicates), and generation of a unified event model used by the front‑end visualizations.

Results – Visualized Release Experience

After the enhancements, users can see:

Top‑N latency distribution directly on the release list page.

Step‑by‑step trace of each release, with clear labels for platform‑side vs. user‑side time.

Detailed timing for normal releases, image‑pull failures, and health‑check failures.

These visual cues dramatically improve troubleshooting speed and overall developer productivity.

Future Plans

SAE will continue to optimize runtime performance, focusing on image acceleration, Java startup, Java runtime enhancements, and instance anomaly diagnosis, further strengthening its cloud‑native, serverless capabilities.

SAE Overview GIF
SAE Overview GIF
serverlessobservabilityAlibaba CloudDeployment Optimizationsae
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.