Operations 19 min read

How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability

Alibaba’s monitoring evolution—from fragmented early tools to the standardized Sunfire platform and now AI‑powered full‑link observability—addresses scaling challenges, introduces business‑centric metrics, automated traceability, and intelligent anomaly detection, illustrating how massive, multi‑tenant infrastructures achieve unified, proactive operations at scale.

Efficient Ops
Efficient Ops
Efficient Ops
How Alibaba Scales Monitoring: From CMDB to AI‑Driven Full‑Link Observability

Expert Wu Feng (alias “Jinjie”) joined Alibaba in 2010, worked on CMDB and the Alimonitor monitoring platform, and later became product manager for Sunfire, a comprehensive monitoring system covering over 80% of Alibaba’s product lines.

1. Alibaba Monitoring Development and Challenges

The evolution is divided into four stages:

1. Pre‑2011: a fragmented era with disparate, custom‑built and open‑source monitoring tools that became hard to maintain at scale.

2. Platform‑centric stage: focus on technical problems of data collection, storage, and alerting, exemplified by Alimonitor, which automated many tasks but suffered from low standardization and high complexity for users.

Many custom metrics with inconsistent naming, units, and formats made analysis difficult.

The system required expert operators, limiting broader adoption.

3. Standardization stage: introduction of the Sunfire platform, which standardized basic monitoring data for new applications and provided automated diagnostic tools.

4. Intelligent stage: aiming for unmanned, fully integrated monitoring operations.

The platform now serves over 4,000 monitoring servers, handles more than 10,000 applications, and processes roughly 2 TB of logs per second, covering more than 80% of Alibaba’s business units.

Key challenges include lack of a global view, inconsistent monitoring standards, missing business‑level perspectives, and high configuration costs.

2. Business Full‑Link Monitoring Thoughts and Construction

Traditional per‑application monitoring cannot provide a complete picture, leading to high coordination costs and uneven monitoring quality.

Alibaba introduced a business‑centric full‑link monitoring model consisting of three layers:

Business domain : a complete business or product, e.g., “transaction domain”.

Business activity : core use cases within a domain, each with “golden metrics” (traffic, latency, error rate) and defined upstream‑downstream relationships.

System service : critical methods supporting an activity, also measured by golden metrics.

The model standardizes monitoring across the business, enabling automatic generation of business links, golden metrics, and drill‑down dashboards.

Key practices:

Identify core business activities and their dependent services.

Use a non‑intrusive, configurable SDK to instrument data points.

Automatically generate business links with traffic, latency, and success‑rate metrics.

Apply intelligent baseline and expert‑rule alerts to detect anomalies without manual rule configuration.

Golden metrics include traffic (QPS, order count), latency (success vs. failure), error count/error rate, and saturation (resource usage). Additional auxiliary metrics can be defined per business dimension (e.g., store, merchant).

Horizontal business dimensions allow fine‑grained monitoring per business identity, merchant, or store, solving the “total volume” blind spot in multi‑tenant scenarios.

3. Exploration of Intelligent Anomaly Detection

Anomaly detection is organized in three layers:

Layer 1: Golden metrics of business activities, each modeled with predictive algorithms for high‑accuracy alerts.

Layer 2: Metrics of dependent system services, monitored with lightweight fluctuation detection triggered only when Layer 1 signals an anomaly.

Layer 3: Application‑level metrics (system, middleware, cache, database), analyzed via generated alarm and change events rather than direct AI.

This top‑down approach quickly narrows the fault scope, pinpointing problematic services, data centers, or network segments.

Smart baseline alerts now cover over 1,200 core metrics, automatically computing prediction baselines and thresholds, with configurable sensitivity.

Overall, Alibaba’s full‑link monitoring transforms massive, heterogeneous services into a unified, proactive operations platform.

AlibabamonitoringoperationsObservabilityAIOpsbusiness metrics
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.