How Alibaba Cloud‑Native Architecture Achieves Scalable Observability and Alerting
This article details the design, data‑collection pipeline, monitoring stack, visualization practices, and alert‑response workflow of a globally deployed Alibaba Cloud‑native system that uses ACK, Prometheus, Grafana, and ARMS to achieve end‑to‑end observability across metrics, tracing, and logs.
System Architecture
The platform runs on Alibaba Cloud Container Service for Kubernetes (ACK). A custom Gateway service (exposed via a LoadBalancer) receives real‑time data streams, forwards them to a Kafka topic for buffering, and triggers Consumer workers that process the data and write results to storage.
Two storage tiers are used:
High‑performance ESSD block storage (expensive, low latency).
Cost‑effective NAS file storage (for less‑critical data).
Metadata is managed by ACM; the Center component queries the storage and returns results to users. The service is deployed in ~20 global regions, creating long read/write paths and a wide monitoring scope (infrastructure, middleware, business components).
Observability Data Collection
Metrics
Metrics are collected at three layers:
Application layer : core business interfaces are monitored with RED (Rate, Error, Duration) indicators. SLOs are defined on these metrics, and an error‑budget is tracked. Business‑specific counters (revenue, UV, PV) are also exported.
Middleware & storage layer : Kafka client offsets, producer buffer usage, consumption latency, message size, broker watermarks, ESSD mount success rate, IOPS, and disk space.
Infrastructure layer : node CPU/memory watermarks, restart counts, K8s core components (API server, etcd, scheduler), pod pending status, OOM/Killed events, VPC/SLB bandwidth and connection drops.
All metrics are scraped by Prometheus. In ACK the node‑exporter DaemonSet exposes node metrics; cloud‑native components (Kafka, CSI storage plugins) expose built‑in metrics. Java services use Micrometer or Spring Boot Actuator, while Go services use the official Prometheus SDK.
// Example Micrometer bean (Java Spring Boot)
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> registry.config().commonTags("application", "gateway");
}Prometheus can be self‑hosted or used as the managed service ARMS Prometheus . Managed Prometheus provides automatic service discovery via ServiceMonitor resources and stores data centrally.
Tracing
Tracing is implemented with ARMS Cmonitor , which injects eBPF probes to capture request flows without modifying application code. The probe records request QPS, latency distribution, and inter‑service call graphs.
Logging
System, K8s control‑plane, and JVM logs are collected by arms‑Promtail and stored in Loki . Grafana can query Loki to filter logs by keywords or pod labels.
Data Ingestion and Storage
Prometheus scrapes metrics from:
Node‑exporter DaemonSet (node metrics).
K8s core components via the /metrics endpoint.
Kafka and ESSD CSI plugins (cloud‑native metrics).
Application endpoints exposed by Micrometer or the Prometheus Go SDK.
In ARMS, a lightweight probe is installed in each cluster; the collected time‑series are stored in a fully managed backend, eliminating the need to operate a separate Prometheus stack.
Visualization and Issue Diagnosis
Grafana is used as the visualization front‑end. Recommended dashboard practices:
Limit the number of time‑series per panel to reduce browser rendering load.
Use Variables to switch data sources, regions, or versions dynamically.
Apply Transform to render tables for ad‑hoc analysis.
Distinguish between Range and Instant queries to avoid unnecessary data volume.
Typical dashboards include node watermarks (CPU, memory, disk IOPS), global SLO view (latency, QPS, error rate), Kafka client/server metrics, and JVM health (memory, GC, thread count).
Alerting and Hierarchical Incident Response
ARMS replaces a custom unstable alert system with a unified solution that offers:
Global alert templates: a single rule can be applied to all clusters/regions.
Dynamic on‑call rotation and notification routing.
Alert enrichment (labels, priority tags) via data‑source callbacks.
Workflow actions: claim, close, mute, and statistical analysis of handling efficiency.
Alert rules are written in PromQL and attached to a template; the template is then applied across clusters. Event handling flows can invoke an HTTP data‑source (e.g., an IFC‑based service that reads ACM configuration) to compute dynamic labels before the alert is fired.
Future Work
Improve alert precision and handling rates by leveraging post‑mortem data to adjust thresholds and introduce multi‑level alerting.
Increase cross‑type data correlation (metrics, traces, logs, profilers, flame graphs) to accelerate root‑cause analysis.
Control instrumentation cost by pruning unused metric dimensions and regularly cleaning redundant data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
