Operations 16 min read

How Alibaba’s SLS Powers a Unified Observability Platform for Massive Data

Alibaba Cloud’s Log Service (SLS) has evolved into a unified observability middle‑platform that handles tens of petabytes daily, offering integrated storage, processing, and AI‑driven analysis for logs, metrics, and traces, while addressing challenges of data ingestion, performance, and scalability across diverse Ops scenarios.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s SLS Powers a Unified Observability Platform for Massive Data

Background of Building an Observability Platform

Alibaba Cloud Log Service (SLS) is a core infrastructure of the Alibaba ecosystem, serving tens of thousands of users and processing about 20 PB of log, metric, and trace data each day. It supports AIOps, big‑data analytics, operational services, and security scenarios, aiming to solve engineers' observability problems and is evolving toward a unified observability middle‑platform.

The author, one of the earliest developers of the internal "Feitian" system, experienced the need for comprehensive monitoring and performance analysis when operating a distributed system on tens of thousands of machines. This led to the creation of the "ShenNong" monitoring system, which later abstracted into SLS to serve broader Ops scenarios.

Technical Challenges of the Platform

Data‑source integration: diverse visualization, collection, and analysis tools result in many vertical systems with different storage formats and APIs.

Performance and speed: Ops scenarios are mission‑critical and require fast, real‑time analysis.

Analysis capability: massive, fragmented data demands dimensionality reduction, correlation, and reasoning, which AIOps algorithms address.

Alibaba Cloud SLS – A Self‑Built Observability Middle‑Platform

SLS is organized around four pillars:

1️⃣ A unified platform.

2️⃣ Two basic storage models – Logstore (for log/trace) and MetricStore (for time‑series metrics) – built on a common storage concept and convertible.

3️⃣ Three analysis engines – a DSL for data‑processing, an SQL engine for querying, and an AI‑driven AIOps engine.

4️⃣ Typical scenarios such as ITOps, DevOps, SecOps, and BusinessOps, covering more than 80 % of Alibaba’s internal use cases.

Storage Design

Four traditional storage systems are used in Ops:

Hadoop/Hive for cheap, high‑latency historical logs.

Elasticsearch for real‑time trace/log retrieval.

NoSQL/TSDB for aggregated metric data.

Kafka for routing temporary data.

These systems suffer from data mobility issues and inconsistent interfaces, limiting the efficiency of AIOps and DataOps.

Abstracting Storage

SLS introduces a FIFO binlog queue that orders data by arrival time. On top of the binlog, a Logstore with a schema (including an EventTime field) is created, allowing SQL‑style queries. MetricStores can be derived from selected Logstore columns to provide time‑series views.

Example log records:

time, host, method, latency, uid, status
[2020-08-10 17:00:00, Site1, UserLogin, 45ms, 1001, OK]
[2020-08-10 17:00:01, Site1, UserBuy, 25ms, 1001, OK]
[2020-08-10 17:00:01, Site1, UserBuy, 1ms, 1001, OK]
[2020-08-10 17:00:01, Site1, UserLogout, 45ms, 1001, OK]
[2020-08-10 17:00:01, Site2, UserLogin, 45ms,1002, Fail]

After ingestion, SQL can compute QPS, latency, etc., and a derived MetricStore can automatically aggregate results such as:

[host, method, time], [qps, latency]
[site1, userLogin, 2020-08-10 17:00:00], [1, 45]
[site1, userBuy, 2020-08-10 17:00:01], [2, 15]
[site1, userLogout, 2020-08-10 17:00:01], [1, 25]

Computation Design

Three core problems are addressed:

Transforming unstructured logs into structured data using a low‑code DSL with hundreds of operators.

Providing a unified query language that blends SQL, PromQL, and machine‑learning functions for complex analysis.

Embedding AI algorithms (inspection, prediction, clustering, root‑cause analysis) as built‑in functions accessible via SQL/DSL.

Example DSL snippet extracts parameters from URLs and enriches logs with additional fields, turning raw access logs into actionable data.

Platform Use Cases

Case 1 – Traffic Solution : Collect raw logs for 7 days in Logstore, backup to OSS, use native SQL for aggregation, store results in MetricStore, and apply AIOps inspection to generate alerts. The entire workflow can be set up in about five minutes.

Case 2 – Cloud Cost Monitoring : SLS ingests daily billing data, analyzes and visualizes cost trends, predicts future expenses, and identifies anomalies, enabling a cost‑management application for Alibaba Cloud users.

Final Thoughts

AIOps combines AI with DevOps/ITOps/SecOps/BusinessOps. Data is the foundation, compute power the substrate, and algorithms the core; all three are indispensable. Successful AIOps deployment relies on domain knowledge from experienced operators, which can be captured through templates, knowledge graphs, or transfer learning.

Engineer career 5‑year change
Engineer career 5‑year change
SLS 1‑2‑3
SLS 1‑2‑3
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataOperationsObservabilitycloud storageaiopsLog Analytics
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.