Operations 19 min read

How We Built a Scalable Multi‑Dimensional Monitoring Platform with Prometheus and M3DB

This article details the redesign of an internal monitoring system, explaining why the original zzmonitor fell short, how Prometheus and its ecosystem were selected, the architecture that integrates remote storage with M3DB, performance benchmarks, Grafana visualisation, and a custom alerting solution.

ITPUB
ITPUB
ITPUB
How We Built a Scalable Multi‑Dimensional Monitoring Platform with Prometheus and M3DB

Background

zzmonitor, the early in‑house monitoring system, only supported four aggregation functions (SUM, MAX, MIN, AVG) and stored data in MySQL with 128 tables for a maximum of seven days. The design caused limited functionality, inflexible APIs, poor time‑series storage performance, and high maintenance cost as business volume grew.

Research and Selection

We evaluated three open‑source monitoring projects—Cat, Nightingale, and Prometheus—against criteria such as community activity, data reporting model, and storage options. Prometheus was chosen for its flexible pull model, rich exporter ecosystem, and active community.

Prometheus Capabilities

3.1 Ecosystem

Prometheus ships with a single‑node TSDB, pulls metrics via HTTP, and discovers targets through service registration. Its extensive exporter library allows rapid integration of most middleware.

3.2 Counter

Counters are monotonically increasing values (e.g., request count). They support QPS calculation and handle resets by adding pre‑reset values to post‑reset samples.

Counter counter = Counter.build().name("upload_picture_total").help("Upload picture count").register();
counter.inc();

3.3 Gauge

Gauges represent values that can go up or down (e.g., memory usage, active threads). They are displayed directly without further aggregation.

Gauge gauge = Gauge.build().name("active_thread_num").help("Active thread count").register();
gauge.set(20);

3.4 Histogram

Histograms bucket observations to show distribution. Each bucket stores the count of observations, while the sum and total count enable average latency calculation. Example bucket configuration and observation:

Histogram histogram = Histogram.build().name("http_request_cost").help("HTTP request latency").buckets(10,20,30,40).register();
histogram.observe(20);

By querying bucket counts over a time window, we can derive percentile values such as TP99.

3.5 Multi‑Dimensional Labels

Metrics can carry arbitrary label dimensions. For instance, a counter with method and uri labels allows aggregation per HTTP method or per endpoint.

Counter counter = Counter.build().name("http_request").labelNames("method","uri").help("HTTP request count").register();
counter.labels("POST","/addGoods").inc();
counter.labels("GET","/getGoods").inc();

3.6 Accuracy Trade‑offs

Prometheus deliberately sacrifices a small amount of precision (e.g., linear extrapolation for missing samples) to gain reliability, simplicity, and lower operational overhead.

Architecture Design

4.1 Remote Storage

Prometheus' built‑in TSDB is insufficient for long‑term retention, so we adopted M3DB—a distributed time‑series database designed for Prometheus remote‑write. M3DB provides high compression and scalability.

4.2 Official Path

The production architecture registers business services with a service registry; Prometheus discovers them, pulls metrics, and forwards them to M3DB via the remote‑write protocol.

4.3 Client Design

We built a lightweight client that implements the Prometheus remote‑write protocol (ProtoBuf + HTTP) and pushes metrics asynchronously in batches directly to M3DB, bypassing the pull step.

4.4 Final Architecture

Business services use the custom client to push metrics; middleware still uses exporters. This eliminates the need for a separate service registry and reduces the number of Prometheus instances.

4.5 Performance Test

Benchmarks show that the client can sustain tens of millions of operations per second with nanosecond‑level latency. Example results (single‑thread): Counter ≈ 43 M ops/s, Gauge ≈ 41 M ops/s, Histogram ≈ 26 M ops/s. Memory usage grows with label count (e.g., 500 labels ≈ 381 KB for Histogram).

Implementation

5.1 Grafana Planning

All environments share a single Grafana instance. Dashboards are organized into four dimensions: Business Overview, Business Services, Architecture Components, and Operations Components.

5.2 Data Integration

Sensitive business metrics are integrated with internal authentication systems (user, service, and permission services) to enforce access control.

5.3 Enterprise WeChat Authentication

Grafana is fronted by Nginx that injects a user header after successful WeChat QR‑code authentication, allowing seamless SSO for internal users.

5.4 Panel Auto‑Initialization

Dashboard JSON templates are pre‑defined; a single line of JSON inserts a new panel for a given metric type.

{"panels":[{"title":"业务指标","panels":[]},{"title":"JVM","panels":[]},{"title":"日志监控","panels":[]}]}

5.5 Automatic Graph Generation

For each metric type the system automatically creates appropriate Grafana panels: Counter → QPS, Increment, Interval Increment; Gauge → Raw points; Histogram → Average, TP99, QPS, Increment, Interval Increment, Distribution, Line Chart, Heatmap.

5.6 Template Room

A “template room” provides ready‑made panels that can be copied with minimal adjustments to cover most business monitoring needs.

Alerting System

6.1 Background

Grafana’s built‑in alerting (pre‑8.0) was limited, and the newer ngalert showed performance bottlenecks when handling thousands of alerts.

6.2 Design

We built a custom alert engine that stores generated PromQL statements in MySQL, schedules evaluation via xxl‑job shards, and queries M3DB to determine alert firing.

6.3 Result

Business users can configure alerts with a few clicks; the system automatically generates the required PromQL and evaluates it efficiently.

Final Effect

7.1 Business Service View

All‑in‑one dashboards show service‑level metrics such as request latency, QPS, and resource usage.

7.2 Architecture Component View

Component‑level panels monitor middleware like thread pools, logs, and Redis connections.

7.3 Operations Component View

Exporter‑based panels expose infrastructure health for Nginx, MySQL, and host machines.

7.4 Business Dashboard

A global business overview aggregates key indicators across all services.

Conclusion

By leveraging open‑source projects and extending them for internal needs, we delivered a unified, extensible monitoring platform that is simple to use, low‑maintenance, and continuously improvable. Since launch, the system has been adopted across all business lines and received positive feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

metricsPrometheusGrafanaM3DBRemote Storage
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.