Big Data 17 min read

Design and Practices of Alibaba Cloud's Billion‑Scale Real‑Time Log Analysis System

This article presents the architecture, core challenges, key design decisions, and future directions of Alibaba Cloud's SLS platform, which handles billions of daily log queries with sub‑300 ms latency by leveraging LSM‑based storage, indexing, columnar layout, distributed caching, and multi‑tenant isolation.

DataFunTalk

Oct 3, 2023

Design and Practices of Alibaba Cloud's Billion‑Scale Real‑Time Log Analysis System

The article introduces Alibaba Cloud SLS, a one‑stop observable log service that provides data collection, unified storage (hot, intelligent cold), and real‑time analysis capabilities, focusing on the storage and analysis foundations for massive log analytics.

Four core problems are identified: (1) ultra‑low query latency for online real‑time analysis, (2) handling petabyte‑scale data with elastic row‑scan ranges, (3) supporting extreme concurrent query loads (up to 72 k concurrent queries during peak events), and (4) ensuring high availability and tenant isolation in a multi‑tenant cloud environment.

Key design solutions include a three‑layer query paradigm (time range, query, analysis), an LSM‑Tree‑based storage model with projects, logstores, and shards, time‑based sharding, inverted indexing, and columnar storage for selective field loading. Index and columnar storage trade space for time, reducing I/O and accelerating computation.

To achieve low latency, the system colocates compute and storage on the same machines, uses domain sockets and shared memory for data exchange, and exploits memory caches, page caches, and shard locality. A multi‑level caching hierarchy (data, index, analysis engine, scheduler, compute, result) further speeds up query planning, execution, and result retrieval.

Scalability is addressed by separating storage and compute, distributing LSM blocks across many nodes, and horizontally scaling I/O, memory, and CPU resources. Affinity‑based scheduling and load‑balancing mitigate hotspot and latency issues introduced by the separation.

High‑concurrency challenges are mitigated by caching SQL parsing/planning, reusing RPC long‑lived connections instead of short HTTP connections, and limiting per‑tenant concurrency through distributed query queues and rate‑limiting mechanisms.

For high availability and tenant isolation, the platform employs time‑slice scheduling, consistent‑hash‑based query routing, per‑tenant resource monitoring, and layered throttling at task, I/O, and data‑scan levels, providing graceful degradation and progressive result refinement.

The article concludes with a summary of how indexing, columnar storage, data locality, and layered caching solved latency; horizontal scaling and storage‑compute separation solved scale; caching and RPC optimizations solved concurrency; and layered throttling ensured high availability and multi‑tenant isolation, supporting thousands of business lines across Alibaba Group.

Future work includes further multi‑tenant resource scheduling optimization, vectorized computation acceleration, and innovative features to improve user experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Indexing Cloud Distributed storage LSM

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.