Operations 19 min read

How We Built a Scalable 3‑Layer Monitoring Platform with Prometheus, M3DB, and Grafana

This article details the design and implementation of a three‑dimensional monitoring system that replaces an outdated custom solution with Prometheus, M3DB remote storage, and Grafana, covering data model choices, metric types, architecture, performance testing, automatic dashboard generation, and a custom alerting service.

dbaplus Community
dbaplus Community
dbaplus Community
How We Built a Scalable 3‑Layer Monitoring Platform with Prometheus, M3DB, and Grafana

Background

ZZMonitor, the early in‑house monitoring system, only supported four aggregation functions (SUM, MAX, MIN, AVG) and stored data in MySQL with 128 tables for a maximum of seven days, leading to limited functionality, inflexible API, poor architecture, and high maintenance cost.

zzmonitor monitoring system
zzmonitor monitoring system

Investigation and Selection

We evaluated Cat, Nightingale, and Prometheus. Cat excels at tracing, Nightingale resembles a simplified Prometheus+Grafana, while Prometheus offers a flexible query language (PromQL) and a rich exporter ecosystem. We selected Prometheus as the core monitoring engine.

Prometheus selection
Prometheus selection

Prometheus Capabilities

1. Ecosystem

Prometheus ships with a single‑node TSDB and uses a pull model to scrape metrics exposed over HTTP. Exporters exist for most middleware, enabling rapid monitoring setup. Grafana provides powerful visualization.

Prometheus architecture
Prometheus architecture

2. Counter

A monotonically increasing metric used for request counts, GC cycles, etc. Prometheus stores the current total; QPS and increments are derived from differences.

Counter counter = Counter.build().name("upload_picture_total").help("上传图片数").register();
counter.inc();

When a server restarts, the counter resets to zero. Prometheus compensates by adding the pre‑reset value to the post‑reset sample, preserving correct increments.

Counter example
Counter example

3. Gauge

A metric that can go up or down, suitable for memory usage, active threads, etc. Grafana typically displays the raw value.

Gauge gauge = Gauge.build().name("active_thread_num").help("活跃线程数").register();
gauge.set(20);
Gauge example
Gauge example

4. Histogram

Used for distribution statistics. Buckets must be defined; each observation increments the appropriate bucket and updates sum and count.

Histogram histogram = Histogram.build().name("http_request_cost").help("Http请求耗时").buckets(10, 20, 30, 40).register();
histogram.observe(20);
public void observe(double value) {
    for (int i = 0; i < bucket.length; ++i) {
        if (value <= bucket[i].le) {
            bucket[i].add(1);
            break;
        }
    }
    sum.add(value);
}

Aggregating bucket counts over a time window yields distribution charts, TP values, and average latency.

Histogram buckets
Histogram buckets

5. Multi‑dimensional Labels

Metrics can carry arbitrary label sets, enabling flexible queries and aggregations.

Counter counter = Counter.build().name("http_request").labelNames("method", "uri").help("Http请求数").register();
counter.labels("POST", "/addGoods").inc();
counter.labels("GET", "/getGoods").inc();
Labelled counter table
Labelled counter table

6. Accuracy Trade‑off

Prometheus deliberately sacrifices a small amount of data precision for higher reliability and simpler operations. It linearly extrapolates between samples when calculating rates.

Linear extrapolation
Linear extrapolation

Architecture Design

Remote Storage

Prometheus’s built‑in TSDB cannot meet long‑term storage needs, so we adopted M3DB (an Uber‑open‑source time‑series database) as remote storage. The data flow passes through a set of components:

M3 Coordinator : stateless bridge between Prometheus and M3DB.

M3DB : distributed TSDB providing scalable storage.

M3 Query : Prometheus‑compatible query engine.

M3 Aggregator : ensures at least one aggregation pass and persists results for down‑sampling.

M3DB architecture
M3DB architecture

Official Deployment Flow

Services register their addresses in a service registry; Prometheus discovers them, scrapes metrics, and forwards data to M3DB via the remote‑write protocol.

Official deployment diagram
Official deployment diagram

Client Redesign

We built a lightweight client that pushes metrics directly to M3DB using Prometheus remote‑write (ProtoBuf + HTTP) in an asynchronous, batch‑oriented manner, eliminating the need for a pull‑only Prometheus server for business services.

Client design
Client design

Final Architecture

Business services push metrics via the new client; middleware continues to use exporters. Grafana visualizes both streams.

Final system diagram
Final system diagram

Performance Testing

Stress tests show no QPS bottleneck (tens of millions per second), nanosecond‑level latency, and memory usage proportional to label count.

Performance chart
Performance chart

Implementation Details

Grafana Planning

Unified Grafana across environments (online, sandbox, test) with four dashboard dimensions: Business Overview, Service‑level, Architecture‑Component, and Operations‑Component.

Three‑dimensional monitoring
Three‑dimensional monitoring

Data Integration

Enterprise WeChat QR‑code authentication is enforced via Nginx injecting a user header; Grafana trusts this header (Auth Proxy).

Authentication flow
Authentication flow

Automatic Panel Generation

Based on metric type, the system auto‑creates Grafana panels:

Counter → QPS, Increment, Interval Increment.

Gauge → Raw value (e.g., 15‑second points).

Histogram → Average, TP99, QPS, Increment, Interval Increment, Distribution table, Distribution line chart, Heatmap.

{
    "panels": [
        {"title": "业务指标", "panels": []},
        {"title": "JVM", "panels": []},
        {"title": "日志监控", "panels": []}
    ]
}
Counter visualization
Counter visualization

Alerting System

Grafana’s built‑in alerting (pre‑8.0) had usability and performance issues. We built a custom alert service inspired by Nightingale: user‑defined alerts are stored as PromQL in MySQL, scheduled via XXL‑Job shards, and evaluated against M3DB.

Alert workflow
Alert workflow

For custom business metrics, users simply set a threshold; the system generates the corresponding PromQL automatically. Built‑in middleware metrics require only a threshold value.

Alert UI
Alert UI

Conclusion

By leveraging open‑source components and extending them for our specific needs, we delivered a unified, three‑dimensional monitoring platform that simplifies metric collection (push + pull), auto‑generates dashboards, provides a custom alerting service, integrates with internal identity and service systems, and has been positively received across the company.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlertingPrometheusGrafanaM3DB
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.