Big Data 11 min read

How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

Facing thousands of nodes in expanding big‑data clusters, the author evaluates legacy monitoring stacks, selects Prometheus + Alertmanager + Grafana, and details its architecture, custom exporters, real‑time alerts, self‑healing mechanisms, and visual dashboards that now support ten large clusters and dozens of services.

dbaplus Community
dbaplus Community
dbaplus Community
How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

Background

As the company's business grew, the scale of its big‑data clusters expanded to thousands of physical nodes, making a robust monitoring system essential for rapid fault detection and resolution. After several evaluations, the author chose Prometheus as the core of a new monitoring platform.

Initial Monitoring Stack Evaluation

The first solutions used were Nagios+Ganglia and Zabbix+Grafana . Both suffered from limitations: Nagios required extensive custom development and lacked historical data storage, while Zabbix+Grafana faced storage‑backend bottlenecks and slow queries when the cluster size increased.

Optimized Platform Selection

To address these issues, the author adopted a Prometheus + Alertmanager + Grafana stack, citing four main advantages:

Built‑in high‑performance TSDB that handles massive concurrent queries.

Powerful PromQL for flexible metric extraction and Grafana visualisation.

Go‑based implementation offers excellent runtime efficiency.

Active GitHub community provides rich client libraries.

The architecture is illustrated below:

Key Platform Features

Leverages existing exporters (e.g., Telegraf, mysql_exporter) for out‑of‑the‑box monitoring.

Custom exporters were developed for Hadoop, Yarn, HBase, exposing metrics such as RPC connection counts, HDFS space usage, Yarn queue performance, and HBase compression stats.

Real‑time message sending via DingTalk webhook enables one‑time notifications for operational events.

Self‑healing capabilities automatically detect host connectivity, suppress night‑time alerts, and execute predefined remediation steps for common failures (e.g., Datanode offline, disk errors, NTP anomalies).

Implementation Examples

1. Namenode RPC Open Connections – The metric is fetched from

http://localhost:50070/jmx?qry=Hadoop:service=NameNode,name=RpcActivityForPort8020

and transformed into a Prometheus‑compatible format.

2. Yarn Queue Usage – Queue resource usage is obtained via http://localhost:8088/ws/v1/cluster/scheduler and visualised in Grafana.

Self‑Healing Workflow

The self‑healing system detects failures, attempts automatic remediation, and notifies operators when manual intervention is required. Notification examples include:

Self‑healing alerts.

Ping‑able but SSH‑unreachable host alerts.

Completely unreachable host alerts.

Monitoring Effectiveness

To provide a real‑time overview, a large‑screen dashboard visualises key HDFS, Yarn, and database metrics, as well as alert notifications. Screenshots demonstrate capacity, health, RPC load, and tenant‑specific usage.

Conclusion and Outlook

The platform now monitors ten large big‑data clusters, over 50 databases, and numerous middleware services, ingesting roughly 50,000 data points per second. Precise alert analysis helps operators pre‑empt issues and mitigate risks early. Future work will focus on smarter alert routing, longer‑term storage optimisation, high‑availability enhancements, and extending the solution to additional business domains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataGrafanaself-healingAlertmanager
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.