Operations 24 min read

How to Build a High‑Performance Unified Monitoring & Alerting Platform

This article outlines a comprehensive design for a high‑performance, unified operations monitoring platform, detailing a six‑layer architecture, the roles of data collection (using Ganglia), data extraction, and alerting modules (with Centreon), and provides practical integration tips, deployment diagrams, and Q&A for large‑scale environments.

Efficient Ops
Efficient Ops
Efficient Ops
How to Build a High‑Performance Unified Monitoring & Alerting Platform
Editor’s Note We have previously read books by Mr. Gao Junfeng; today we see his presentation, which is logically rigorous and clearly structured. Starting from the design of a monitoring system, he demonstrates how a high‑performance monitoring platform should be architected and layered. Modern service architecture emphasizes modularity, asynchronous processing, layered design, low coupling, and high cohesion; this article showcases a clear responsibility‑driven design for reference.

Guest Introduction:

Gao Junfeng (South Africa Ant) Senior Linux technology expert, author of best‑selling books “Step‑by‑Step Linux” and “Practical High‑Performance Linux Server Construction”. Formerly at Sina and Wanwang, with extensive experience in automation, operations, Linux, clustering, MySQL, Oracle, system management, performance tuning, planning, and design. Currently focuses on Hadoop data platforms and related ecosystem operations, monitoring, deployment, and optimization.

Preface

Hello, I am Ant from iVei Linux (South Africa Ant). Today I will share how to build a unified operations monitoring platform.

Monitoring is the core of operations. A good monitoring platform greatly assists operations work. The key question is how to construct a complete monitoring platform.

In my view, the core of operations consists of monitoring and fault handling. Accurate, comprehensive monitoring of business systems ensures problems are detected and notified promptly.

Problems are not scary; the real danger is not discovering them for a long time and having customers notice the failure.

Design Concept of a Unified Monitoring & Alert Platform

To build an intelligent operations monitoring platform, we must focus on runtime monitoring and fault alerting, incorporating network, hardware, software, and database resources from all business systems into a single platform.

By eliminating differences in management software and data collection methods, we achieve unified management, standardization, processing, presentation, user login, and permission control, ultimately realizing standardized, automated, and intelligent large‑scale operations management.

Six Layers of an Intelligent Operations Monitoring Platform

The architecture can be divided into six layers and three major modules (see diagram).

Data Collection Layer: Bottom layer that gathers network, business, database, and OS data, normalizes it, and stores it.

Data Presentation Layer: Web interface that visualizes collected data as curves, bar charts, pie charts, etc., helping operators understand system status and trends.

Data Extraction Layer: Filters and processes data from the collection layer, extracting the information needed for the monitoring & alerting modules.

Alert Rule Configuration Layer: Sets alert rules, thresholds, contacts, and notification methods based on extracted data.

Alert Event Generation Layer: Records alert events in real time, stores results in a database, and generates analysis reports to track fault rates and trends.

User Presentation Management Layer: Top‑most web interface that displays monitoring statistics and alerts, supporting multi‑user and multi‑permission management.

Three Major Modules

The six layers are functionally grouped into three modules: data collection, data extraction, and monitoring & alerting.

Data Collection Module: Performs basic data gathering and visualization. Collection methods include SNMP, agents, or custom scripts; this article uses Ganglia.

Data Extraction Module: Filters and extracts needed data from the collection module, either via provided interfaces or custom scripts.

Monitoring & Alerting Module: Configures monitoring scripts, alert rules, thresholds, contacts, and centralizes alert results. Common tools include Nagios and Centreon.

Platform Overview

The platform consists of three main parts: data collection, data extraction, and monitoring & alerting. The extraction module bridges the other two, while one or more collection servers gather metrics and store them. The alerting module consumes extracted data, applies thresholds, and sends notifications via SMS, email, or custom plugins.

Ganglia as the Data Collection Module

Ganglia is a scalable distributed monitoring system designed for HPC clusters. It collects CPU, memory, disk, I/O, and network metrics via the gmond daemon on each node, aggregates them with gmetad, stores data with rrdtool, and presents historical trends via a PHP web interface.

Key Features of Ganglia

Flexible distributed, hierarchical architecture supporting thousands of nodes with stable performance; can be deployed by region or layer and dynamically add or remove nodes without impact.

Accurate data collection with real‑time graphs and historical statistics, enabling performance tuning, scaling, and capacity planning.

Supports both multicast and unicast transmission, reducing load in large deployments.

Collects metrics for CPU, memory, disk, I/O, processes, and network; provides C and Python APIs for custom metric plugins.

Custom gmond modules are available at https://github.com/ganglia/gmond_python_modules

Ganglia is preferred over Cacti/Zabbix for large‑scale data collection due to its accuracy and low overhead. In our production environment monitoring over 10,000 servers across three data centers, alert latency is typically around 10 seconds.

Centreon as the Monitoring & Alerting Module

While Ganglia gathers data, operators need automated alerting. Centreon, built on top of Nagios, provides a powerful distributed IT monitoring system with the following advantages:

Open‑source and free to use.

Deep integration with Nagios; Centreon reads data from Nagios‑written databases and displays it via a web UI.

Provides a web‑based configuration interface for managing Nagios settings.

Supports numerous plugins (NRPE, SNMP, NSClient, etc.) to build a distributed monitoring and alerting system.

Centreon writes collected metrics to a database, offers a web UI for host/service configuration, supports multiple notification channels (SMS, email, custom scripts), and stores alert history for analysis.

Seamless Integration of Ganglia and Centreon

Ganglia excels at data collection, while Nagios (and thus Centreon) excels at alerting. Combining them leverages their strengths:

Ganglia lacks built‑in alert notifications; Nagios provides them.

Nagios lacks distributed data collection; Ganglia provides it.

Nagios has limited reporting; Ganglia offers rich graphs.

Ganglia’s APIs allow its metrics to be fed into Nagios.

The chosen solution uses Ganglia for collection and Centreon for alerting. The remaining challenge is transferring collected data to the alerting module, which is the role of the data extraction module.

Functions of the Data Extraction Module

The module periodically pulls specified metrics from the collection module, compares them against configured thresholds, and triggers alerts via the monitoring & alerting module when thresholds are breached.

All configuration (collection intervals, thresholds, notification methods, contacts) resides in the monitoring & alerting module; the extraction module merely bridges data.

Implementation typically involves custom development on top of Ganglia, often using a Python script. Ready‑made scripts are available:

PHP version: http://www.iivey.com/ganglia/check_ganglia_metric.php.txt Python version: http://www.iivey.com/ganglia/check_ganglia_metric.py.txt

Unified Monitoring System Architecture Diagram

Each data center (Cluster1‑N) runs a Gmond daemon on its nodes, aggregating metrics to a Ganglia proxy where gmetad stores data. Both proxy and nodes can load custom C/Python plugins. A Manager Server collects data from all centers, runs the extraction module to integrate Ganglia and Centreon, and provides a high‑availability backup.

Ganglia Data Flow Diagram

Data flow steps:

gmond collects local metrics and exchanges them with other gmond instances via UDP (XDL format).

gmond supports both unicast and multicast transmission.

gmetad periodically pulls XML data from gmond or other gmetad nodes over TCP.

gmetad can also receive XML from peer gmetad instances.

gmetad stores the data in RRD databases.

Nagios (via the extraction module) monitors Ganglia data and generates alerts.

The web UI retrieves data from gmetad and renders graphs from the RRD files.

With this architecture, a complete operations monitoring platform has been running stably for over three years, monitoring more than 12,000 servers.

Q&A

Is gmetad a single‑node component? What configuration is needed for monitoring 10,000 hosts?

gmetad is the core data collector and should be deployed in at least a primary‑secondary pair. For 10,000 hosts across three data centers, a distributed gmetad setup is recommended; Ganglia itself supports hierarchical data aggregation.

Are alert policies limited to greater‑than or less‑than thresholds?

Centreon provides flexible policies: greater‑than, less‑than, equal, retry counts, intervals, etc., all configurable to suit specific needs.

What monitoring items are needed for middleware and databases?

Monitoring items are customized per business requirements and typically fall into three categories: underlying system data, business‑logic‑independent data, and business‑logic‑related data.

How does gmond transmit data in XDL format?

gmond supports both unicast (reporting to a parent node) and multicast (peer‑to‑peer collection). The transmitted data uses the XDL format, viewable via telnet to the gmond port.

Does the platform support both Windows and Linux?

Yes, it supports full‑platform monitoring for various operating systems, network devices, and switches.

Is there a dedicated team maintaining the monitoring system?

Our monitoring platform is maintained by an operations team that also performs secondary development.

Does the platform support monitoring of metric fluctuations?

Yes; Ganglia’s UDP‑based data collection provides rapid updates, making it suitable for distributed multi‑data‑center monitoring where network fluctuations can be accounted for via parameter tuning.

How many monitoring items and alerts are generated for over 10,000 machines?

We monitor over 10,000 items; each XDL data batch is about 20‑30 MB. For Hadoop‑centric workloads we recommend unicast collection to reduce network load.

How is alert latency kept around 10 seconds, and what about resource‑heavy monitoring points?

gmond agents have minimal overhead. The bottleneck is at aggregation points, which require high‑performance CPUs and disks. Time‑out mechanisms can be set for slow‑responding checks, especially across data‑center links.

How many people are in the monitoring and development teams?

Six people: 2 in development/operations, 3 in system (business) operations, and 1 in network operations.

MonitoringoperationsAlertinginfrastructureCentreonGanglia
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.