Operations 18 min read

How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Alibaba’s SunFire platform delivers massive‑scale, real‑time log collection, processing, and visualization for e‑commerce spikes like Double 11, using low‑overhead agents, asynchronous Map/Reduce pipelines, fault‑tolerant task scheduling, and shared inputs to ensure accurate, low‑latency monitoring across billions of transactions.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Introduction

In fiscal year 2016 Alibaba’s e‑commerce transaction volume exceeded 3 trillion RMB, with the Double 11 shopping festival generating 1 207 billion RMB in a single day. The massive, second‑level data flow required a robust monitoring solution, which is the focus of this article.

Alibaba Monitoring Landscape

The group‑level monitoring platform is built entirely in‑house, covering about 80% of Alibaba’s monitoring needs. Individual business units also develop specialized systems such as GoldenEye (advertising), Prism (Cainiao), Tianji (Alibaba Cloud), and EagleEye (middleware).

1. SunFire Monitoring Technical Implementation

SunFire is a comprehensive, real‑time log analysis solution that collects data via logs, REST APIs, and shell scripts, offering device‑, application‑, and business‑level monitoring views.

Key Advantages

All‑view real‑time monitoring with second‑level critical metrics and minute‑level normal metrics.

Flexible alarm rules based on business characteristics, time windows, and severity.

Simple management with minute‑level deployment for tens of thousands of devices and automatic fault recovery.

Customizable configuration for alarms and dashboards.

Rich visual dashboards for personalized monitoring panels.

Low resource consumption on host CPU and memory.

1.1 Collection (Agent)

The Agent runs on each application host, performing raw log collection and command execution without any computation logic. It emphasizes low CPU usage by:

Compressing logs to reduce cross‑data‑center bandwidth.

Leveraging zero‑copy file transfer (sendfile) to minimize user‑space copying.

Implementing a binary‑search‑based offset finder (LogFinder) to locate log segments efficiently, keeping CPU usage below 5%.

Agent also handles log rotation scenarios and provides two query modes: first‑query (offset discovery) and ordinary‑query (sequential reads).

1.2 Computation

The computation layer consists of Map and Reduce components that process collected logs. It adopts a fully asynchronous, coroutine‑based design (using Akka) to avoid thread‑pool contention and lock‑based bottlenecks.

All I/O is performed with non‑blocking NIO, eliminating CPU waste.

Tasks are grouped per core to maximize CPU utilization while keeping I/O off the critical path.

A plugin‑oriented, period‑driven scheduler generates a topology for each time slice, installing Map and Reduce coroutines across the cluster.

Period‑Driven Scheduling

Each period (e.g., a minute) creates an isolated task topology: Brain selects a leader, reads user configurations, and builds a topology object containing plugins, input sources, and Map/Reduce counts. Reduce installs tasks on Map nodes, which in turn launch Agent coroutines to pull logs, parse them, and forward results upstream.

Task Retry and Fault Tolerance

Supervision is built into the topology: each upstream component monitors its downstream peers via Terminated events. If a Brain, Reduce, or Map instance fails, the supervisor recreates the missing component based on the stored topology, ensuring no data loss—only possible delay.

Input Sharing

When multiple user configurations require the same log source, SunFire shares the input to avoid redundant pulls. This is achieved by:

Analyzing topology during installation to detect shared inputs.

Using consistent hashing to assign the same log source to the same Map node across tasks.

2. Other Components

Storage

Computation results are persisted in Alibaba’s HBase for unlimited scalability and low‑latency queries. User‑defined storage can also use MongoDB.

Visualization

The front‑end provides customizable dashboards built on a plugin architecture, enabling rapid creation of new monitoring products.

Self‑Management

OPS‑Agent and OPS‑Web automate the deployment, health checking, and capacity monitoring of millions of agents across the platform.

Conclusion

SunFire has been productized to deliver second‑level monitoring capabilities to developers and operations teams, offering a flexible, low‑overhead, and highly reliable solution for Alibaba’s massive online services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlibabamonitoringReal-TimeOperationslog analysis
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.