Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading
This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.
Background: Enterprise‑grade data clusters often handle petabytes of data and thousands of heterogeneous jobs, making maintenance challenging; SRE must implement comprehensive monitoring, risk prediction, and proactive issue analysis.
01 Big Data Monitoring System – Monitoring is positioned as a bridge across the entire stack, covering not only traditional infrastructure metrics but also data quality, task status, component performance, trend forecasting, and comparative analysis.
The monitoring system is divided into seven dimensions, each with distinct metrics and collection methods.
Continuous state reading (e.g., CPU, memory) with threshold‑based alerts.
Time‑series aggregation (e.g., HDFS storage increment) triggering alerts on abnormal slopes.
Correlated monitoring where multiple conditions must be met before alerting.
Open‑source tools alone cannot cover all needs, so custom components and scripts are integrated.
02 Basic Monitoring
Basic monitoring consists of four parts: low‑level infrastructure, service‑status, component‑performance, and runtime monitoring. The architecture is illustrated below.
Key practices include grouping hosts, reusing alert templates across departments, and adjusting thresholds per workload (e.g., CPU 65 % for business servers vs. 90 % for big‑data nodes).
02.2 Cluster Performance Monitoring
Build a Flask exporter to pull component metrics via APIs.
Use Alertmanager to define alert rules.
Deploy a custom notification service ("ZhiYinLou") for phone alerts.
Connect Prometheus as a data source to Grafana for dashboards.
Resulting dashboards provide visual insight into cluster health.
02.3 Runtime Monitor
A lightweight Python client periodically interacts with core services (HDFS, Hive, etc.) performing basic CRUD operations to verify core functionality; it acts as an independent health‑check loop that catches failures missed by other monitors.
03 Monitoring Upgrade
Beyond basic health checks, upgraded monitoring adds cluster‑level metrics, trend forecasting, and task‑level analysis, requiring data collection, aggregation, and analytics using ES, Kafka, Logstash, Flink, Hive, and custom Flask processing.
Outputs include weekly CPU/memory trends, storage‑capacity forecasts, and per‑task resource usage analyses.
04 Alarm Convergence and Grading
Effective alerts should either be suppressed or demand immediate attention. The current practice groups alerts by host clusters and templates, achieving ~60 % convergence, with plans to incorporate AI‑ops techniques for deeper analysis.
Alert grading distinguishes P0 (critical, phone + immediate notification) from P1 (core group notification) and P2+ (aiming for ≥80 % convergence), reducing noise and focusing on actionable incidents.
The article concludes that a robust monitoring and alerting framework is essential for large‑scale big‑data operations and hints at future deep‑dive posts on upgraded monitoring implementations.
TAL Education Technology
TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.