Design and Implementation of Alibaba Cloud's 10PB+ Daily Log Service
This article presents an in‑depth interview with Alibaba Cloud senior expert Sun Tingtao, detailing the architecture, core features, design challenges, and operational strategies of the Alibaba Cloud Log Service that handles over 10 PB of daily log data for massive, diverse production workloads.
In an ArchSummit interview, Alibaba Cloud senior technical expert Sun Tingtao (aka Longwu) discusses the design and implementation of Alibaba Cloud Log Service, a foundational log platform that provides data collection, real‑time indexing, and analysis for petabyte‑scale workloads across the Alibaba Group.
Core Functions
The service offers comprehensive capabilities:
Collection: Logtail, SDK/Producer for Java/C/Go/iOS/Android/web, global acceleration via DCDN.
Storage & Consumption: ConsumerGroup with auto load‑balancing, failover, checkpoint persistence; integration with Blink/Flink/Storm/Spark Streaming; Shipper for OSS and MaxCompute.
Query & Analysis: Supports Text, JSON, Double, Long; Chinese language, full‑text/key‑value search; hundred‑billion‑scale queries; DevOps features such as Live Tail, LogReduce, anomaly detection, root‑cause analysis; SQL92, interactive queries, machine learning, security functions.
Visualization: Native dashboards, dozens of chart types, JDBC interface for DataV, Grafana, QuickBI.
Characteristics of Production‑grade Logs
Massive scale: tens of PB generated daily.
Diverse types: access logs, system logs, application logs, etc.
Multiple sources: servers, network devices, embedded systems, web, Docker, mobile apps.
Broad usage: monitoring, diagnostics, analytics, reporting.
High reliability requirements.
Peak loads during events like Double‑11.
The service unifies collection, storage, query, and analysis to simplify user interaction and focus on extracting value from log data.
Core Design Requirements
High reliability to avoid data loss.
Resource isolation for QoS.
Cost‑effective performance for PB‑scale ingestion and indexing.
Manageability of millions of client agents and version upgrades.
Fast query and analysis over massive datasets.
System Architecture
Client agents for data collection.
Front‑end module exposing RESTful APIs.
Storage and indexing modules.
Query and analysis modules.
Meta‑information management.
Monitoring and operations management.
Key Design Challenges
Balancing performance and cost: custom distributed engine using succinct trees for dictionary encoding, hybrid bitmap for inverted indexes, improved BKD‑Tree for numeric indexing, SIMD acceleration.
Ensuring always‑available service in large‑scale clusters: automatic fault isolation, multi‑tenant resource sharing, dynamic QoS adjustments.
Designing storage and index structures for trillion‑scale logs: multi‑layer cache, reduced inter‑node data transfer, coroutine‑based compute, adaptive compaction.
Managing millions of agents and priority‑based flow control during peak events.
Stability Guarantees
All modules are horizontally scalable; front‑end intercepts abnormal traffic, shard‑level flow control limits impact to 1‑3% of total traffic; internal auto‑scheduling balances load, isolates resources per query, and a comprehensive monitoring system tracks key metrics.
Log Data Processing Use Cases
Real‑time query and analysis with SQL, context, Live Tail, LogReduce, anomaly detection, root‑cause analysis.
Visual analytics via custom canvases and interactive dashboards.
Alerting with SMS, email, webhook, DingTalk integration.
Secondary development for tracing, customer service, and other applications.
Integration with Other Systems
Streaming systems: Flink/Blink, Storm, Spark Streaming (provides consumption libraries).
Offline systems: archiving to OSS or MaxCompute for deeper analysis.
DataWorks: real‑time consumption and export to databases like MySQL.
Serverless: Function Compute for custom log processing without managing infrastructure.
The architecture and operational practices described illustrate how Alibaba Cloud Log Service achieves high reliability, scalability, and cost efficiency for massive, heterogeneous log workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
