Building a High‑Performance Monitoring Alert System with Akka, Dubbo, and Ignite
The article outlines G Bank’s transition from a single‑threaded commercial monitoring solution to a self‑developed, open‑source based alert system that leverages Akka for parallel collection, Apache Dubbo for distributed processing, and Apache Ignite for in‑memory storage, achieving million‑level alert capacity, sub‑100 ms latency, and linear scalability.
Background
Traditional monitoring in the bank used a commercial suite that performed alert collection with a single‑threaded process and stored alerts in an in‑memory database. Under alert storms the collector dropped data, the database blocked, and latency grew to minutes.
Problems
Data loss and processing blockage during high‑volume alert storms; latency up to minutes.
Simple processing logic could not handle complex, high‑concurrency scenarios.
Solution Overview
The new generation alert system is built entirely on open‑source components to achieve massive concurrent alert handling, flexible rule configuration, and full lifecycle management.
Alert Lifecycle Management
The system follows a closed‑loop lifecycle: generation & ingestion → pre‑processing → storage → notification → post‑recovery closure.
Core Functionalities
Unified ingestion and agile access for heterogeneous sources.
Reduced latency and timely notification.
Root‑cause recommendation and assistance.
Tracking, recovery verification and closure.
Key Architectural Features
The system acts as an alert Manager‑of‑Managers (MOM) and must ingest alerts from infrastructure, middleware, databases, cloud platforms, and business applications.
Technical Design
1. Akka‑Based Parallel Collection
Akka provides a high‑concurrency, distributed, fault‑tolerant runtime based on the Actor model. The collector consists of the following actors:
Data Collection Actor : pulls or receives raw alerts (polling for active sources, passive for push‑based sources).
Raw Data Dispatch Actor : routes raw alerts to analysis actors and performs overall flow control.
Data Analysis Actor : a configurable pool of actors that execute user‑defined processing logic in parallel.
Persisted Data Dispatch Actor : forwards processed data to persistence actors and applies back‑pressure when the storage layer is slow.
Data Persistence Actor : a configurable set of actors that write alerts to the storage backend.
2. Apache Dubbo Distributed Framework
Dubbo supplies high‑performance RPC, intelligent fault tolerance, load balancing, and automatic service registration/discovery. Two services are exposed:
Data Processing Service : CRUD APIs for collectors and other applications (compression, recovery, etc.).
Data Synchronization Service : periodic and incremental backup between primary and backup clusters.
3. APP‑Based Processing for High Configurability
Each processing node runs modular APP containers. An APP represents a logical processing unit (e.g., maintenance window handling, enrichment, notification). APPs can be hot‑plugged, developed with scripts or Scala/Java, and support graceful upgrade.
Stream APP : runs on every node, processes real‑time alerts that match its criteria.
Scheduled Batch APP : a single instance scheduled by the cluster’s scheduler to process a batch of alerts at a fixed interval.
Subscription Batch APP : subscribes to output of Stream or Scheduled Batch APPs for further aggregation.
Broadcast Batch APP : runs on all nodes, processes data assigned by a scheduler for distributed batch work.
Restful APP : dynamically generates REST endpoints to expose internal APP data.
APP containers support hot‑swap, script‑to‑bytecode compilation via Antlr and Java dynamic compilation, and graceful stop‑start where an updating APP finishes in‑flight processing before shutting down.
4. Apache Ignite Distributed In‑Memory Storage
Ignite provides a partitioned, distributed memory cache across five nodes (each 128 GB). Data is stored in ATOMIC mode for high throughput.
SQL tables for active alerts, historical alerts, notification archives, and configuration data.
Key‑value caches for lookup data (CMDB, resource metadata) used during enrichment and pre‑processing.
Memory partitions: active 16 GB, resource 8 GB, history 52 GB, notification 16 GB per node.
Performance Results
Active alert capacity: tens of millions (≈200× previous system).
Historical storage: billions of records.
Write throughput: 11 653 ops/s (≈10× previous).
Alert processing latency: <100 ms (30‑50× improvement).
Scalability: +2 000 ops/s per additional server.
Future Directions
Micro‑service‑based ingestion with webhook interfaces for easier integration.
AI‑driven root‑cause analysis and alert convergence.
Deeper correlation of alerts with performance, configuration, and KPI data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
