Operations 9 min read

How Zhongtong Scaled Elasticsearch Monitoring with ESPaaS: Architecture, Alerts, and Diagnosis

Zhongtong built the ESPaaS platform to automate deployment, unify monitoring, and provide real‑time alerts and diagnostic capabilities for over 40 Elasticsearch clusters, handling petabytes of data with Prometheus, Grafana, and DingTalk integrations while sharing practical lessons learned.

dbaplus Community

Aug 24, 2020

How Zhongtong Scaled Elasticsearch Monitoring with ESPaaS: Architecture, Alerts, and Diagnosis

Background

Since 2015 Zhongtong has operated Elasticsearch in production. By July 2020 the fleet comprised more than 40 clusters and 500+ nodes, ingesting ~600 billion documents per day and storing over 6 PB of data. The scale and version diversity required a unified management platform, ESPaaS (Elasticsearch Platform as a Service).

Architecture Design

The monitoring core uses Prometheus with a custom exporter that exposes key ES metrics via a REST endpoint keyed by cluster name. Prometheus scrapes the exporter, stores time‑series data, and Grafana visualises the metrics. Alert rules written in PromQL evaluate anomalies per cluster. Alerts are sent to DingTalk with priority handling and a delayed‑alert mechanism that suppresses transient spikes. A diagnosis module merges real‑time data with historical trends to surface potential issues.

Monitoring Dimensions

Resource level – CPU, memory, network, and disk usage of ES host machines.

Cluster level – overall health status of each Elasticsearch cluster.

Node level – JVM heap, thread‑pool statistics, and other per‑node indicators that affect cluster health.

Dashboard Overview

Grafana dashboards display real‑time cluster health, shard distribution, garbage‑collection activity, and resource utilisation, enabling operators to quickly assess performance and locate bottlenecks.

Cluster Diagnosis

Diagnosis quantifies core metrics into five dimensions: capacity, performance, stability, resource usage, and error patterns. Standardised troubleshooting steps allow the platform to predict disk‑usage trends, recommend pre‑emptive scaling, and avoid production incidents.

Practical Experience

Write failures caused by read‑only indices – Nodes reaching >90 % disk usage trigger the index.blocks.read_only_allow_delete flag, blocking writes. Reset the flag with the following request:

PUT _all/_settings
{
  "index.blocks.read_only_allow_delete": "false"
}

Interpreting dashboard metrics – Example: a 1‑master + 13‑data node cluster shows green status, low shard count, peak GC during business hours, and heap usage at 50‑70 %, indicating no immediate resource pressure.

Handling frequent alerts – Initial one‑minute alert intervals caused alert storms. Introducing escalating intervals (e.g., repeat after 30 minutes, then longer) reduces noise.

Delayed‑alert design – Short‑lived red/yellow states (e.g., during index reopening) are filtered by a configurable delay; if the metric recovers within the window, no alert is emitted.

Future Directions

Containerising Elasticsearch with Kubernetes to improve stateful service management.

Developing an ES‑Proxy layer that exposes search capabilities as a standardized service, abstracting cluster topology from end users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Alerting prometheus diagnosis

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.