How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions
Alibaba’s SunFire platform delivers massive‑scale, real‑time log collection, processing, and visualization for e‑commerce spikes like Double 11, using low‑overhead agents, asynchronous Map/Reduce pipelines, fault‑tolerant task scheduling, and shared inputs to ensure accurate, low‑latency monitoring across billions of transactions.
Introduction
In fiscal year 2016 Alibaba’s e‑commerce transaction volume exceeded 3 trillion RMB, with the Double 11 shopping festival generating 1 207 billion RMB in a single day. The massive, second‑level data flow required a robust monitoring solution, which is the focus of this article.
Alibaba Monitoring Landscape
The group‑level monitoring platform is built entirely in‑house, covering about 80% of Alibaba’s monitoring needs. Individual business units also develop specialized systems such as GoldenEye (advertising), Prism (Cainiao), Tianji (Alibaba Cloud), and EagleEye (middleware).
1. SunFire Monitoring Technical Implementation
SunFire is a comprehensive, real‑time log analysis solution that collects data via logs, REST APIs, and shell scripts, offering device‑, application‑, and business‑level monitoring views.
Key Advantages
All‑view real‑time monitoring with second‑level critical metrics and minute‑level normal metrics.
Flexible alarm rules based on business characteristics, time windows, and severity.
Simple management with minute‑level deployment for tens of thousands of devices and automatic fault recovery.
Customizable configuration for alarms and dashboards.
Rich visual dashboards for personalized monitoring panels.
Low resource consumption on host CPU and memory.
1.1 Collection (Agent)
The Agent runs on each application host, performing raw log collection and command execution without any computation logic. It emphasizes low CPU usage by:
Compressing logs to reduce cross‑data‑center bandwidth.
Leveraging zero‑copy file transfer (sendfile) to minimize user‑space copying.
Implementing a binary‑search‑based offset finder (LogFinder) to locate log segments efficiently, keeping CPU usage below 5%.
Agent also handles log rotation scenarios and provides two query modes: first‑query (offset discovery) and ordinary‑query (sequential reads).
1.2 Computation
The computation layer consists of Map and Reduce components that process collected logs. It adopts a fully asynchronous, coroutine‑based design (using Akka) to avoid thread‑pool contention and lock‑based bottlenecks.
All I/O is performed with non‑blocking NIO, eliminating CPU waste.
Tasks are grouped per core to maximize CPU utilization while keeping I/O off the critical path.
A plugin‑oriented, period‑driven scheduler generates a topology for each time slice, installing Map and Reduce coroutines across the cluster.
Period‑Driven Scheduling
Each period (e.g., a minute) creates an isolated task topology: Brain selects a leader, reads user configurations, and builds a topology object containing plugins, input sources, and Map/Reduce counts. Reduce installs tasks on Map nodes, which in turn launch Agent coroutines to pull logs, parse them, and forward results upstream.
Task Retry and Fault Tolerance
Supervision is built into the topology: each upstream component monitors its downstream peers via Terminated events. If a Brain, Reduce, or Map instance fails, the supervisor recreates the missing component based on the stored topology, ensuring no data loss—only possible delay.
Input Sharing
When multiple user configurations require the same log source, SunFire shares the input to avoid redundant pulls. This is achieved by:
Analyzing topology during installation to detect shared inputs.
Using consistent hashing to assign the same log source to the same Map node across tasks.
2. Other Components
Storage
Computation results are persisted in Alibaba’s HBase for unlimited scalability and low‑latency queries. User‑defined storage can also use MongoDB.
Visualization
The front‑end provides customizable dashboards built on a plugin architecture, enabling rapid creation of new monitoring products.
Self‑Management
OPS‑Agent and OPS‑Web automate the deployment, health checking, and capacity monitoring of millions of agents across the platform.
Conclusion
SunFire has been productized to deliver second‑level monitoring capabilities to developers and operations teams, offering a flexible, low‑overhead, and highly reliable solution for Alibaba’s massive online services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
