Big Data 17 min read

Design and Practices of Alibaba Cloud's Billion‑Scale Real‑Time Log Analysis System

This article presents the architecture, core challenges, key design decisions, and future directions of Alibaba Cloud's SLS platform, which handles billions of daily log queries with sub‑300 ms latency by leveraging LSM‑based storage, indexing, columnar layout, distributed caching, and multi‑tenant isolation.

DataFunTalk
DataFunTalk
DataFunTalk
Design and Practices of Alibaba Cloud's Billion‑Scale Real‑Time Log Analysis System

The article introduces Alibaba Cloud SLS, a one‑stop observable log service that provides data collection, unified storage (hot, intelligent cold), and real‑time analysis capabilities, focusing on the storage and analysis foundations for massive log analytics.

Four core problems are identified: (1) ultra‑low query latency for online real‑time analysis, (2) handling petabyte‑scale data with elastic row‑scan ranges, (3) supporting extreme concurrent query loads (up to 72 k concurrent queries during peak events), and (4) ensuring high availability and tenant isolation in a multi‑tenant cloud environment.

Key design solutions include a three‑layer query paradigm (time range, query, analysis), an LSM‑Tree‑based storage model with projects, logstores, and shards, time‑based sharding, inverted indexing, and columnar storage for selective field loading. Index and columnar storage trade space for time, reducing I/O and accelerating computation.

To achieve low latency, the system colocates compute and storage on the same machines, uses domain sockets and shared memory for data exchange, and exploits memory caches, page caches, and shard locality. A multi‑level caching hierarchy (data, index, analysis engine, scheduler, compute, result) further speeds up query planning, execution, and result retrieval.

Scalability is addressed by separating storage and compute, distributing LSM blocks across many nodes, and horizontally scaling I/O, memory, and CPU resources. Affinity‑based scheduling and load‑balancing mitigate hotspot and latency issues introduced by the separation.

High‑concurrency challenges are mitigated by caching SQL parsing/planning, reusing RPC long‑lived connections instead of short HTTP connections, and limiting per‑tenant concurrency through distributed query queues and rate‑limiting mechanisms.

For high availability and tenant isolation, the platform employs time‑slice scheduling, consistent‑hash‑based query routing, per‑tenant resource monitoring, and layered throttling at task, I/O, and data‑scan levels, providing graceful degradation and progressive result refinement.

The article concludes with a summary of how indexing, columnar storage, data locality, and layered caching solved latency; horizontal scaling and storage‑compute separation solved scale; caching and RPC optimizations solved concurrency; and layered throttling ensured high availability and multi‑tenant isolation, supporting thousands of business lines across Alibaba Group.

Future work includes further multi‑tenant resource scheduling optimization, vectorized computation acceleration, and innovative features to improve user experience.

real-timeIndexingcloudDistributed StorageLog AnalysisLSM
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.