Big Data 12 min read

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.

TAL Education Technology
TAL Education Technology
TAL Education Technology
Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

Background: Enterprise‑grade data clusters often handle petabytes of data and thousands of heterogeneous jobs, making maintenance challenging; SRE must implement comprehensive monitoring, risk prediction, and proactive issue analysis.

01 Big Data Monitoring System – Monitoring is positioned as a bridge across the entire stack, covering not only traditional infrastructure metrics but also data quality, task status, component performance, trend forecasting, and comparative analysis.

The monitoring system is divided into seven dimensions, each with distinct metrics and collection methods.

Continuous state reading (e.g., CPU, memory) with threshold‑based alerts.

Time‑series aggregation (e.g., HDFS storage increment) triggering alerts on abnormal slopes.

Correlated monitoring where multiple conditions must be met before alerting.

Open‑source tools alone cannot cover all needs, so custom components and scripts are integrated.

02 Basic Monitoring

Basic monitoring consists of four parts: low‑level infrastructure, service‑status, component‑performance, and runtime monitoring. The architecture is illustrated below.

Key practices include grouping hosts, reusing alert templates across departments, and adjusting thresholds per workload (e.g., CPU 65 % for business servers vs. 90 % for big‑data nodes).

02.2 Cluster Performance Monitoring

Build a Flask exporter to pull component metrics via APIs.

Use Alertmanager to define alert rules.

Deploy a custom notification service ("ZhiYinLou") for phone alerts.

Connect Prometheus as a data source to Grafana for dashboards.

Resulting dashboards provide visual insight into cluster health.

02.3 Runtime Monitor

A lightweight Python client periodically interacts with core services (HDFS, Hive, etc.) performing basic CRUD operations to verify core functionality; it acts as an independent health‑check loop that catches failures missed by other monitors.

03 Monitoring Upgrade

Beyond basic health checks, upgraded monitoring adds cluster‑level metrics, trend forecasting, and task‑level analysis, requiring data collection, aggregation, and analytics using ES, Kafka, Logstash, Flink, Hive, and custom Flask processing.

Outputs include weekly CPU/memory trends, storage‑capacity forecasts, and per‑task resource usage analyses.

04 Alarm Convergence and Grading

Effective alerts should either be suppressed or demand immediate attention. The current practice groups alerts by host clusters and templates, achieving ~60 % convergence, with plans to incorporate AI‑ops techniques for deeper analysis.

Alert grading distinguishes P0 (critical, phone + immediate notification) from P1 (core group notification) and P2+ (aiming for ≥80 % convergence), reducing noise and focusing on actionable incidents.

The article concludes that a robust monitoring and alerting framework is essential for large‑scale big‑data operations and hints at future deep‑dive posts on upgraded monitoring implementations.

monitoringoperationsSREAlertingprometheusGrafana
TAL Education Technology
Written by

TAL Education Technology

TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.