Big Data 17 min read

Design and Practice of Alibaba Cloud's Billion‑Scale Real‑Time Log Analysis

This article presents Alibaba Cloud's SLS billion‑scale real‑time log analysis architecture, covering business background, core challenges such as low‑latency queries, massive data scale, high concurrency, and multi‑tenant isolation, and detailing key design solutions like LSM‑based storage, index‑columnar storage, data locality, layered caching, and future directions.

DataFunSummit
DataFunSummit
DataFunSummit
Design and Practice of Alibaba Cloud's Billion‑Scale Real‑Time Log Analysis

Alibaba Cloud SLS (Log Service) is a one‑stop observability platform that provides powerful data collection, processing, and delivery capabilities; its data‑ingestion tool ilogtail is fully open‑source. The service offers unified hot and cold storage, intelligent tiering, and rich query and analysis functions such as ad‑hoc and correlation analysis.

The system faces four core problems: (1) ultra‑low query latency for online real‑time analysis (seconds‑level), (2) massive data‑processing scale ranging from millions to hundreds of billions of rows per day, (3) extremely high concurrent query pressure (up to 72 000 concurrent queries and tens of thousands of QPS during peak events like Double 11), and (4) ensuring high availability and tenant isolation in a multi‑tenant cloud environment.

Key design solutions include a three‑part query paradigm (query statement, analysis SQL, time range), an LSM‑Tree‑based storage model with a three‑level hierarchy (project → logstore → shard), time‑based shard partitioning, inverted indexing and column‑store for fast data location and selective loading, and data‑locality where compute nodes share the same machine with storage nodes using domain sockets and shared memory. A layered cache is employed at the index, column, metadata, plan, and result levels, and RPC long‑connections replace short HTTP connections to reduce connection overhead. Storage‑compute separation and horizontal scaling address the massive data‑scale challenge, while affinity scheduling and load‑balancing mitigate hotspot and latency issues.

Future work focuses on further optimizing multi‑tenant resource scheduling, introducing vectorized execution to accelerate computation, and exploring innovative features to enhance user experience.

In the Q&A, the speaker explains how inverted indexes are combined with memory for efficient retrieval, the handling of hot versus cold data latency, and the strategies for block splitting based on size or entry count.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Dataindexinghigh concurrencydistributed storagemulti-tenant isolationreal-time log analysis
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.