Operations 9 min read

Building ESPaaS: Real‑Time Elasticsearch Monitoring and Alerting at Scale

Zhongtong’s ESPaaS platform automates deployment, unified monitoring, real‑time alerting, and diagnostic analysis for over 40 Elasticsearch clusters, leveraging custom exporters, Prometheus, Grafana, and DingTalk integrations to track resource, cluster, and node metrics, reduce noise, and prevent production incidents.

dbaplus Community

Feb 24, 2021

Building ESPaaS: Real‑Time Elasticsearch Monitoring and Alerting at Scale

Background

Since 2015 Zhongtong has been researching and using Elasticsearch clusters in production, scaling to more than 40 clusters and over 500 nodes by 2020, handling daily document ingestion of ~600 billion and data growth exceeding 100 TB per day, with total storage over 6 PB.

Platform Evolution

The growing number and version diversity of clusters made unified management a priority, leading to the development of the ESPaaS (Elasticsearch Platform as a Service) operations platform, which provides automated deployment, centralized monitoring, real‑time alerting, and index management.

Architecture Design

Prometheus was chosen as the core monitoring engine. A custom exporter collects key metrics from each ES cluster via a REST API that returns data per cluster name. Prometheus pulls these metrics, Grafana visualizes them, and alert rules evaluate PromQL expressions to detect anomalies.

Alert notifications are sent to DingTalk groups, with priority levels, delayed alerts to suppress transient spikes, and a diagnostic module that combines real‑time data with historical trends.

Monitoring & Alerting Features

Real‑time cluster monitoring

Alert output

Cluster diagnosis

Metrics Collected

Resource metrics: CPU, memory, network, disk of ES nodes

Cluster‑level health status

Node‑level JVM and thread‑pool statistics

Dashboard Examples

(Images omitted for brevity)

Diagnostic Module

The diagnostic component quantifies core indicators and standardizes troubleshooting procedures across five dimensions (illustrated in the original diagram). It enables proactive detection of potential issues, such as capacity planning and index design adjustments, preventing production failures.

Practical Experience

Common issues and solutions include:

Write failures due to read‑only indices – caused by nodes with disk usage > 90 %. Resolved by clearing the read‑only block:

PUT _all/_settings
{
  "index.blocks.read_only_allow_delete":"false"
}

Interpreting monitoring dashboards – metrics reveal cluster health (green), shard distribution, GC activity during peak hours, heap usage, and potential hot nodes.

Handling frequent alerts – initial one‑minute checks caused alert storms; introducing escalating alert intervals reduced noise.

Delayed alerts for self‑recovering issues – short‑lived red/yellow states from index reopening are silenced if the condition resolves within a configurable window.

Future Directions

Containerizing Elasticsearch with Kubernetes for stateful service deployment.

Providing an ES‑Proxy to expose search capabilities as a standardized service, abstracting cluster details from users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch diagnostics grafana

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.