Operations 16 min read

Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB

This article details the background, research, architecture, performance testing, and deployment of a comprehensive monitoring system that leverages Prometheus, Grafana, and M3DB to provide flexible metric collection, automatic dashboard generation, and a custom alerting service for large‑scale business services.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB

Background: the early zzmonitor system only supported four aggregation functions (SUM, MAX, MIN, AVG) and stored data in MySQL with limited retention, causing functional gaps, inflexible API design, poor time‑series storage performance, and high maintenance costs.

Research & selection: after evaluating Cat, Nightingale, and Prometheus, the team chose Prometheus for its flexible PromQL, rich exporter ecosystem, and active community.

Prometheus capabilities: a built‑in single‑node TSDB with pull‑based metric collection, support for Counter, Gauge, Histogram, and multi‑dimensional labels, and a design that tolerates minor data errors through linear extrapolation.

Architecture design: remote storage was implemented using M3DB (M3 Coordinator, M3DB, M3 Query, M3 Aggregator). The client follows Prometheus remote‑write (ProtoBuf + HTTP) and pushes metrics asynchronously in batches directly to M3DB, eliminating the need for a separate Prometheus server.

Performance testing: QPS reaches tens of millions (e.g., 43 M QPS for Counter in single‑thread), latency stays in the 20‑40 ns range, and memory usage scales with the number of labels (e.g., 381 KB for 500 labels in Histogram).

Implementation details: a unified Grafana instance serves all environments; dashboards are generated from JSON templates; authentication is handled via Enterprise WeChat proxy; automatic panel initialization creates appropriate visualizations for Counter, Gauge, and Histogram metrics. Example code snippets are shown below.

public void test() {
    long start = System.currentTimeMillis();
    // do something
    long cost = System.currentTimeMillis() - start;
    ZMonitor.sum("执行次数", 1);
    ZMonitor.max("最大耗时", cost);
    ZMonitor.min("最小耗时", cost);
    ZMonitor.avg("平均耗时", cost);
}
Counter counter = Counter.build().name("upload_picture_total").help("上传图片数").register();
counter.inc();
Gauge gauge = Gauge.build().name("active_thread_num").help("活跃线程数").register();
gauge.set(20);
Histogram histogram = Histogram.build().name("http_request_cost").help("Http请求耗时").buckets(10,20,30,40).register();
histogram.observe(20);

Alerting system: a custom alert service generates PromQL statements stored in MySQL, schedules checks via XXL‑Job, and evaluates conditions against M3DB, allowing users to configure alerts with simple threshold inputs.

Final outcome: the platform provides business‑service, architecture‑component, and operations‑component dashboards, a unified monitoring view, low maintenance overhead, and extensibility through open‑source contributions, receiving positive feedback across business lines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringMetricsalertingTime-series
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.