Big Data 14 min read

Design and Implementation of a Comprehensive Monitoring System for a Big Data Platform

This article describes the end‑to‑end design, metric hierarchy, data collection methods, visualization dashboards, and alerting mechanisms used to build a robust monitoring system for a large‑scale big‑data platform, covering physical hosts, Hadoop components, business services, and data layers with tools such as Telegraf, Prometheus, and Grafana.

YunZhu Net Technology Team
YunZhu Net Technology Team
YunZhu Net Technology Team
Design and Implementation of a Comprehensive Monitoring System for a Big Data Platform

Background

The YunZhu big‑data platform initially used scattered monitoring approaches, leading to inconsistent data collection, incomplete metric coverage, and no unified dashboard, which hindered stability as services grew.

Overall Design

A data‑warehouse‑style architecture is adopted, separating real‑time dashboards from offline alert analysis. The design includes four layers: physical host, big‑data component, business service, and business data.

2.1 Layered Metric Hierarchy

Physical host layer – CPU, memory, I/O, disk.

Big‑data component layer – HDFS, YARN, Zookeeper, Kafka, ClickHouse, Hive, Trino, etc.

Business service layer – custom services (edata, master data, AI inference).

Business data layer – Hive tables, ClickHouse tables, Elasticsearch indices.

2.2 Metric Examples

Each component defines specific monitoring items (e.g., HDFS total capacity, YARN running tasks, Kafka broker memory) with severity levels p0, p1, p2.

3 Data Collection

3.1 Physical Host Collection

Telegraf agents collect time‑series metrics (CPU, memory, disk, network) from all hosts.

3.2 Big‑Data Suite Collection

Prometheus scrapes JMX exporters for Hadoop ecosystem components. Core Prometheus components (server, exporters, push gateway, client SDK) are used.

3.3 Business Service Collection

Spring Boot applications integrate the Prometheus client library to expose service‑level metrics.

3.4 Business Data Collection

Python scripts extract metadata from Hive Metastore (MySQL), ClickHouse system tables, and Elasticsearch APIs, synchronizing them into Hive for reporting.

4 Monitoring Visualization

Grafana dashboards aggregate metrics from Prometheus and Elasticsearch, providing real‑time views for operations and weekly business reports via an internal reporting platform.

5 Alerting

5.1 Alert Levels

Alerts are classified as p0 (critical, phone + DingTalk), p1 (warning, DingTalk), etc., with examples such as low memory or node down.

5.2 Alert Convergence

Efforts are underway to reduce noise by auto‑remediating issues and consolidating alerts.

5.3 Alert Implementation

Host‑level alerts via Zabbix triggers.

Component alerts via Prometheus Alertmanager.

Business data alerts via custom Python scripts.

Future plans include a unified Kafka‑Flink pipeline for rule‑based alert processing.

monitoringbig datadata collectionalertingPrometheusGrafanaTelegraf
YunZhu Net Technology Team
Written by

YunZhu Net Technology Team

Technical practice sharing from the YunZhu Net Technology Team

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.