How Alibaba Cloud’s SAE Achieves High Stability with Diagnostic Engines and Probes
This article explains how Alibaba Cloud's Serverless Application Engine (SAE) builds end‑to‑end stability by dividing fault handling into prevention, detection, localization and recovery, using a Kubernetes‑based diagnostic engine, runtime availability probes, a unified alert center, and a plug‑in architecture for root‑cause analysis.
Stability Construction Overview
SAE (Serverless Application Engine) implements a four‑stage fault‑handling pipeline: prevention , detection , localization , and recovery . The pipeline is built on a diagnostic engine, a runtime availability probe, and a unified alert center.
Prevention Layer
Region‑based refactoring and architectural upgrades.
Unit‑test (UT) and end‑to‑end (E2E) coverage to catch regressions early.
Fault‑injection drills to validate recovery paths.
Governance of agent, image, and IaaS dependencies to limit cascade failures.
Detection Layer
Detection relies on two complementary mechanisms:
Diagnostic engine that watches Kubernetes resource state changes and evaluates declarative DSL rules.
Runtime availability probe that runs as a sidecar inside each user instance, continuously checking dependencies and reporting heartbeats.
Both feed events into a unified alert center for immediate notification.
Diagnostic Engine (Infra Diagnosis)
The engine translates common failure patterns into a declarative DSL and processes Kubernetes events with a modified informer to avoid memory pressure and event loss.
Compressed resource usage : objects are placed in a workqueue and only the latest resourceVersion is retained.
Bookmark mechanism : the latest resourceVersion is persisted so that a restart resumes watching without re‑processing historic sync events.
Dynamic finalizer management : finalizers are added to Pods/Jobs during processing and removed afterwards to guarantee continuity across restarts.
The engine can monitor 100 % of core Kubernetes resources without loss and builds a time‑series view of related objects.
Root‑Cause Localization
When a DSL rule matches, the engine creates a diagnostic event and inserts it into a delay queue for the configured duration. If the condition clears, the event is removed, preventing false alarms. Diagnostic rules are packaged as lightweight plugins executed via RPC on Function Compute, providing:
Decoupled development and rapid iteration.
Dynamic extensibility by updating the DSL.
Fault isolation – a misbehaving plugin does not affect the whole engine.
Orchestration through a directed‑acyclic graph (DAG) state machine.
Runtime Availability Probe
The probe runs as a sidecar and performs periodic checks of network, storage, configuration, and other dependencies. Its architecture consists of:
Configuration center : gray‑release control and per‑app settings.
Daemon process : version hot‑update, health‑check handling, signal forwarding.
Worker process : executes dependency checks and emits heartbeat metadata.
Persistent storage : aggregates heartbeats for downstream consumption.
Diagnostic engine consumer : enriches probe data for root‑cause analysis.
The probe is lightweight (memory < 100 MB, negligible CPU) and self‑monitors its resource usage, triggering a circuit‑breaker when thresholds are exceeded.
Unified Alert Center
Alerts from the diagnostic engine, probes, Kubernetes events, logs, and external monitoring systems are normalized into a unified event model, de‑duplicated, enriched, and routed to ARMS. A four‑level alert hierarchy aligned with SLA/SLO targets critical issues first, and an upgrade strategy escalates alerts based on incident count, affected applications, and time‑window metrics (MTTD, MTTA, MTTR).
Event Center
Events are categorized as:
Runtime events (e.g., health‑check failures, OOM, task failures).
Change events (e.g., deployment rollouts, image pull failures).
System events (e.g., node anomalies, internal migrations).
Cloud‑product events (e.g., SLB packet loss, quota exhaustion, configuration conflicts, resource deletions).
Users can subscribe by application, namespace, or region and configure thresholds and notification channels.
Infra Diagnostic Engine Details
State Listening
SAE replaces the native informer cache with a workqueue‑only design to reduce memory consumption. It retains only the newest version of each object by comparing resourceVersion. A bookmark event persists the latest resourceVersion so that a restart resumes from that point, eliminating duplicate sync events and preventing loss of delete events.
Dynamic finalizers are attached to Pods and Jobs; they are removed only after successful processing, ensuring event continuity during high‑concurrency or restart scenarios.
The mechanism covers all core Kubernetes resources, with steady‑state memory usage under 100 MB.
Pattern Diagnosis
Common failure scenarios are abstracted into DSL patterns such as:
● Resource A has/does not have field X
● Resource A is in state X
● Resource A is in state X for ≥ S minutes
● Resource A has field X while also having field Y
● Resource A references Resource B but B does not exist
● Resource A references Resource B, A is in state X while B is in state YExample DSL for a sidecar container not ready for 300 s:
{
"PodSidecarNotReady": {
"duration": 300,
"resource": "Pod",
"filterField": [
{"field": ["status", "phase"], "equalValue": ["Running"]},
{"field": ["metadata", "labels", "sae.component"], "equalValue": ["app"]},
{"field": ["metadata", "deletionTimestamp"], "nilValue": true}
],
"checkField": [{
"field": ["status", "containerStatuses"],
"equalValue": ["false"],
"subIdentifierName": "name",
"subIdentifierValue": ["sidecar"],
"subField": ["ready"]
}]
}
}Example DSL for a Service stuck in Deleting for 300 s:
{
"ServiceDeleting": {
"duration": 300,
"resource": "Service",
"filterField": [{
"field": ["metadata", "labels", "sae-domain"],
"equalValue": ["sae-domain"]
}],
"checkField": [{
"field": ["metadata", "deletionTimestamp"],
"notEqualValue": [""]
}]
}
}The engine matches incoming Unstructured objects against these rules, inserts matching objects into a delay queue, and emits a diagnostic event after the configured duration. If the condition clears before the timeout, the queued entry is removed, avoiding unnecessary alerts.
Root‑Cause Determination
Feature plugins implement specialized analysis (e.g., image pull failures, network latency, resource leaks). Plugins run on Function Compute, are invoked via RPC, and are orchestrated as a DAG based on their declared dependencies. This micro‑kernel architecture provides:
Agile development and decoupled debugging.
Dynamic plug‑in insertion/removal by editing the DSL.
Fault isolation – a failing plug‑in does not crash the engine.
Composable workflows through a state‑machine executor.
Common plug‑ins include monitoring spikes, release status, instance exception distribution, stdout capture, and environment snapshots.
Runtime Probe Dependency Checks
The probe validates the following dependency categories (network, storage, configuration, etc.) and reports a status code for each check. Heartbeats are pushed to a persistent store, enabling the diagnostic engine to consume them in real time. The push model avoids port conflicts, scales horizontally, and works with the single‑ENI network topology of SAE instances.
Configuration center controls gray‑release version and per‑app enablement.
Daemon handles hot‑updates, health checks, and signal forwarding.
Worker executes the actual checks and assembles heartbeat metadata.
Persistent storage aggregates heartbeats for visualization and downstream analysis.
Diagnostic engine consumes heartbeats to enrich root‑cause analysis.
Probe Resource Management
The probe uses a two‑process model (daemon + worker). The daemon pulls configuration and manages lifecycle events; the worker performs the checks. Hot‑updates are triggered by version tags in the heartbeat, allowing zero‑downtime upgrades without traversing the Kubernetes control plane. Resource consumption is kept minimal (a few megabytes of RAM, near‑zero CPU). The probe self‑monitors its own usage and aborts with a circuit‑breaker if configured thresholds are exceeded.
Alert Hierarchy and Upgrade Strategy
Four alert levels are defined based on severity and impact. An upgrade strategy evaluates the number of similar alerts, affected applications, and user count within the alert’s active window. When thresholds are crossed, the alert is escalated to a higher level and routed to additional on‑call personnel. Metrics such as Mean Time To Detect (MTTD), Mean Time To Acknowledge (MTTA), and Mean Time To Resolve (MTTR) are collected via ARMS for continuous improvement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
