Big Data 12 min read

Design and Implementation of Alibaba Cloud's 10PB+ Daily Log Service

This article presents an in‑depth interview with Alibaba Cloud senior expert Sun Tingtao, detailing the architecture, core features, design challenges, and operational strategies of the Alibaba Cloud Log Service that handles over 10 PB of daily log data for massive, diverse production workloads.

Big Data Technology & Architecture

Jun 3, 2019

Design and Implementation of Alibaba Cloud's 10PB+ Daily Log Service

In an ArchSummit interview, Alibaba Cloud senior technical expert Sun Tingtao (aka Longwu) discusses the design and implementation of Alibaba Cloud Log Service, a foundational log platform that provides data collection, real‑time indexing, and analysis for petabyte‑scale workloads across the Alibaba Group.

Core Functions

The service offers comprehensive capabilities:

Collection: Logtail, SDK/Producer for Java/C/Go/iOS/Android/web, global acceleration via DCDN.

Storage & Consumption: ConsumerGroup with auto load‑balancing, failover, checkpoint persistence; integration with Blink/Flink/Storm/Spark Streaming; Shipper for OSS and MaxCompute.

Query & Analysis: Supports Text, JSON, Double, Long; Chinese language, full‑text/key‑value search; hundred‑billion‑scale queries; DevOps features such as Live Tail, LogReduce, anomaly detection, root‑cause analysis; SQL92, interactive queries, machine learning, security functions.

Visualization: Native dashboards, dozens of chart types, JDBC interface for DataV, Grafana, QuickBI.

Characteristics of Production‑grade Logs

Massive scale: tens of PB generated daily.

Diverse types: access logs, system logs, application logs, etc.

Multiple sources: servers, network devices, embedded systems, web, Docker, mobile apps.

Broad usage: monitoring, diagnostics, analytics, reporting.

High reliability requirements.

Peak loads during events like Double‑11.

The service unifies collection, storage, query, and analysis to simplify user interaction and focus on extracting value from log data.

Core Design Requirements

High reliability to avoid data loss.

Resource isolation for QoS.

Cost‑effective performance for PB‑scale ingestion and indexing.

Manageability of millions of client agents and version upgrades.

Fast query and analysis over massive datasets.

System Architecture

Client agents for data collection.

Front‑end module exposing RESTful APIs.

Storage and indexing modules.

Query and analysis modules.

Meta‑information management.

Monitoring and operations management.

Key Design Challenges

Balancing performance and cost: custom distributed engine using succinct trees for dictionary encoding, hybrid bitmap for inverted indexes, improved BKD‑Tree for numeric indexing, SIMD acceleration.

Ensuring always‑available service in large‑scale clusters: automatic fault isolation, multi‑tenant resource sharing, dynamic QoS adjustments.

Designing storage and index structures for trillion‑scale logs: multi‑layer cache, reduced inter‑node data transfer, coroutine‑based compute, adaptive compaction.

Managing millions of agents and priority‑based flow control during peak events.

Stability Guarantees

All modules are horizontally scalable; front‑end intercepts abnormal traffic, shard‑level flow control limits impact to 1‑3% of total traffic; internal auto‑scheduling balances load, isolates resources per query, and a comprehensive monitoring system tracks key metrics.

Log Data Processing Use Cases

Real‑time query and analysis with SQL, context, Live Tail, LogReduce, anomaly detection, root‑cause analysis.

Visual analytics via custom canvases and interactive dashboards.

Alerting with SMS, email, webhook, DingTalk integration.

Secondary development for tracing, customer service, and other applications.

Integration with Other Systems

Streaming systems: Flink/Blink, Storm, Spark Streaming (provides consumption libraries).

Offline systems: archiving to OSS or MaxCompute for deeper analysis.

DataWorks: real‑time consumption and export to databases like MySQL.

Serverless: Function Compute for custom log processing without managing infrastructure.

The architecture and operational practices described illustrate how Alibaba Cloud Log Service achieves high reliability, scalability, and cost efficiency for massive, heterogeneous log workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data Indexing Operations Alibaba Cloud Log Service

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.