Operations 19 min read

How We Built a Scalable 3‑Layer Monitoring Platform with Prometheus, M3DB, and Grafana

This article details the design and implementation of a three‑dimensional monitoring system that replaces an outdated custom solution with Prometheus, M3DB remote storage, and Grafana, covering data model choices, metric types, architecture, performance testing, automatic dashboard generation, and a custom alerting service.

dbaplus Community

Oct 16, 2022

How We Built a Scalable 3‑Layer Monitoring Platform with Prometheus, M3DB, and Grafana

Background

ZZMonitor, the early in‑house monitoring system, only supported four aggregation functions (SUM, MAX, MIN, AVG) and stored data in MySQL with 128 tables for a maximum of seven days, leading to limited functionality, inflexible API, poor architecture, and high maintenance cost.

Investigation and Selection

We evaluated Cat, Nightingale, and Prometheus. Cat excels at tracing, Nightingale resembles a simplified Prometheus+Grafana, while Prometheus offers a flexible query language (PromQL) and a rich exporter ecosystem. We selected Prometheus as the core monitoring engine.

Prometheus Capabilities

1. Ecosystem

Prometheus ships with a single‑node TSDB and uses a pull model to scrape metrics exposed over HTTP. Exporters exist for most middleware, enabling rapid monitoring setup. Grafana provides powerful visualization.

2. Counter

A monotonically increasing metric used for request counts, GC cycles, etc. Prometheus stores the current total; QPS and increments are derived from differences.

Counter counter = Counter.build().name("upload_picture_total").help("上传图片数").register();
counter.inc();

When a server restarts, the counter resets to zero. Prometheus compensates by adding the pre‑reset value to the post‑reset sample, preserving correct increments.

3. Gauge

A metric that can go up or down, suitable for memory usage, active threads, etc. Grafana typically displays the raw value.

Gauge gauge = Gauge.build().name("active_thread_num").help("活跃线程数").register();
gauge.set(20);

4. Histogram

Used for distribution statistics. Buckets must be defined; each observation increments the appropriate bucket and updates sum and count.

Histogram histogram = Histogram.build().name("http_request_cost").help("Http请求耗时").buckets(10, 20, 30, 40).register();
histogram.observe(20);

public void observe(double value) {
    for (int i = 0; i < bucket.length; ++i) {
        if (value <= bucket[i].le) {
            bucket[i].add(1);
            break;
        }
    }
    sum.add(value);
}

Aggregating bucket counts over a time window yields distribution charts, TP values, and average latency.

5. Multi‑dimensional Labels

Metrics can carry arbitrary label sets, enabling flexible queries and aggregations.

Counter counter = Counter.build().name("http_request").labelNames("method", "uri").help("Http请求数").register();
counter.labels("POST", "/addGoods").inc();
counter.labels("GET", "/getGoods").inc();

6. Accuracy Trade‑off

Prometheus deliberately sacrifices a small amount of data precision for higher reliability and simpler operations. It linearly extrapolates between samples when calculating rates.

Architecture Design

Remote Storage

Prometheus’s built‑in TSDB cannot meet long‑term storage needs, so we adopted M3DB (an Uber‑open‑source time‑series database) as remote storage. The data flow passes through a set of components:

M3 Coordinator : stateless bridge between Prometheus and M3DB.

M3DB : distributed TSDB providing scalable storage.

M3 Query : Prometheus‑compatible query engine.

M3 Aggregator : ensures at least one aggregation pass and persists results for down‑sampling.

Official Deployment Flow

Services register their addresses in a service registry; Prometheus discovers them, scrapes metrics, and forwards data to M3DB via the remote‑write protocol.

Client Redesign

We built a lightweight client that pushes metrics directly to M3DB using Prometheus remote‑write (ProtoBuf + HTTP) in an asynchronous, batch‑oriented manner, eliminating the need for a pull‑only Prometheus server for business services.

Final Architecture

Business services push metrics via the new client; middleware continues to use exporters. Grafana visualizes both streams.

Performance Testing

Stress tests show no QPS bottleneck (tens of millions per second), nanosecond‑level latency, and memory usage proportional to label count.

Implementation Details

Grafana Planning

Unified Grafana across environments (online, sandbox, test) with four dashboard dimensions: Business Overview, Service‑level, Architecture‑Component, and Operations‑Component.

Data Integration

Enterprise WeChat QR‑code authentication is enforced via Nginx injecting a user header; Grafana trusts this header (Auth Proxy).

Automatic Panel Generation

Based on metric type, the system auto‑creates Grafana panels:

Counter → QPS, Increment, Interval Increment.

Gauge → Raw value (e.g., 15‑second points).

Histogram → Average, TP99, QPS, Increment, Interval Increment, Distribution table, Distribution line chart, Heatmap.

{
    "panels": [
        {"title": "业务指标", "panels": []},
        {"title": "JVM", "panels": []},
        {"title": "日志监控", "panels": []}
    ]
}

Alerting System

Grafana’s built‑in alerting (pre‑8.0) had usability and performance issues. We built a custom alert service inspired by Nightingale: user‑defined alerts are stored as PromQL in MySQL, scheduled via XXL‑Job shards, and evaluated against M3DB.

For custom business metrics, users simply set a threshold; the system generates the corresponding PromQL automatically. Built‑in middleware metrics require only a threshold value.

Conclusion

By leveraging open‑source components and extending them for our specific needs, we delivered a unified, three‑dimensional monitoring platform that simplifies metric collection (push + pull), auto‑generates dashboards, provides a custom alerting service, integrates with internal identity and service systems, and has been positively received across the company.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alerting prometheus grafana M3DB

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.