How We Built a Scalable Multi‑Dimensional Monitoring Platform with Prometheus and M3DB
This article details the redesign of an internal monitoring system, explaining why the original zzmonitor fell short, how Prometheus and its ecosystem were selected, the architecture that integrates remote storage with M3DB, performance benchmarks, Grafana visualisation, and a custom alerting solution.
Background
zzmonitor, the early in‑house monitoring system, only supported four aggregation functions (SUM, MAX, MIN, AVG) and stored data in MySQL with 128 tables for a maximum of seven days. The design caused limited functionality, inflexible APIs, poor time‑series storage performance, and high maintenance cost as business volume grew.
Research and Selection
We evaluated three open‑source monitoring projects—Cat, Nightingale, and Prometheus—against criteria such as community activity, data reporting model, and storage options. Prometheus was chosen for its flexible pull model, rich exporter ecosystem, and active community.
Prometheus Capabilities
3.1 Ecosystem
Prometheus ships with a single‑node TSDB, pulls metrics via HTTP, and discovers targets through service registration. Its extensive exporter library allows rapid integration of most middleware.
3.2 Counter
Counters are monotonically increasing values (e.g., request count). They support QPS calculation and handle resets by adding pre‑reset values to post‑reset samples.
Counter counter = Counter.build().name("upload_picture_total").help("Upload picture count").register();
counter.inc();3.3 Gauge
Gauges represent values that can go up or down (e.g., memory usage, active threads). They are displayed directly without further aggregation.
Gauge gauge = Gauge.build().name("active_thread_num").help("Active thread count").register();
gauge.set(20);3.4 Histogram
Histograms bucket observations to show distribution. Each bucket stores the count of observations, while the sum and total count enable average latency calculation. Example bucket configuration and observation:
Histogram histogram = Histogram.build().name("http_request_cost").help("HTTP request latency").buckets(10,20,30,40).register();
histogram.observe(20);By querying bucket counts over a time window, we can derive percentile values such as TP99.
3.5 Multi‑Dimensional Labels
Metrics can carry arbitrary label dimensions. For instance, a counter with method and uri labels allows aggregation per HTTP method or per endpoint.
Counter counter = Counter.build().name("http_request").labelNames("method","uri").help("HTTP request count").register();
counter.labels("POST","/addGoods").inc();
counter.labels("GET","/getGoods").inc();3.6 Accuracy Trade‑offs
Prometheus deliberately sacrifices a small amount of precision (e.g., linear extrapolation for missing samples) to gain reliability, simplicity, and lower operational overhead.
Architecture Design
4.1 Remote Storage
Prometheus' built‑in TSDB is insufficient for long‑term retention, so we adopted M3DB—a distributed time‑series database designed for Prometheus remote‑write. M3DB provides high compression and scalability.
4.2 Official Path
The production architecture registers business services with a service registry; Prometheus discovers them, pulls metrics, and forwards them to M3DB via the remote‑write protocol.
4.3 Client Design
We built a lightweight client that implements the Prometheus remote‑write protocol (ProtoBuf + HTTP) and pushes metrics asynchronously in batches directly to M3DB, bypassing the pull step.
4.4 Final Architecture
Business services use the custom client to push metrics; middleware still uses exporters. This eliminates the need for a separate service registry and reduces the number of Prometheus instances.
4.5 Performance Test
Benchmarks show that the client can sustain tens of millions of operations per second with nanosecond‑level latency. Example results (single‑thread): Counter ≈ 43 M ops/s, Gauge ≈ 41 M ops/s, Histogram ≈ 26 M ops/s. Memory usage grows with label count (e.g., 500 labels ≈ 381 KB for Histogram).
Implementation
5.1 Grafana Planning
All environments share a single Grafana instance. Dashboards are organized into four dimensions: Business Overview, Business Services, Architecture Components, and Operations Components.
5.2 Data Integration
Sensitive business metrics are integrated with internal authentication systems (user, service, and permission services) to enforce access control.
5.3 Enterprise WeChat Authentication
Grafana is fronted by Nginx that injects a user header after successful WeChat QR‑code authentication, allowing seamless SSO for internal users.
5.4 Panel Auto‑Initialization
Dashboard JSON templates are pre‑defined; a single line of JSON inserts a new panel for a given metric type.
{"panels":[{"title":"业务指标","panels":[]},{"title":"JVM","panels":[]},{"title":"日志监控","panels":[]}]}5.5 Automatic Graph Generation
For each metric type the system automatically creates appropriate Grafana panels: Counter → QPS, Increment, Interval Increment; Gauge → Raw points; Histogram → Average, TP99, QPS, Increment, Interval Increment, Distribution, Line Chart, Heatmap.
5.6 Template Room
A “template room” provides ready‑made panels that can be copied with minimal adjustments to cover most business monitoring needs.
Alerting System
6.1 Background
Grafana’s built‑in alerting (pre‑8.0) was limited, and the newer ngalert showed performance bottlenecks when handling thousands of alerts.
6.2 Design
We built a custom alert engine that stores generated PromQL statements in MySQL, schedules evaluation via xxl‑job shards, and queries M3DB to determine alert firing.
6.3 Result
Business users can configure alerts with a few clicks; the system automatically generates the required PromQL and evaluates it efficiently.
Final Effect
7.1 Business Service View
All‑in‑one dashboards show service‑level metrics such as request latency, QPS, and resource usage.
7.2 Architecture Component View
Component‑level panels monitor middleware like thread pools, logs, and Redis connections.
7.3 Operations Component View
Exporter‑based panels expose infrastructure health for Nginx, MySQL, and host machines.
7.4 Business Dashboard
A global business overview aggregates key indicators across all services.
Conclusion
By leveraging open‑source projects and extending them for internal needs, we delivered a unified, extensible monitoring platform that is simple to use, low‑maintenance, and continuously improvable. Since launch, the system has been adopted across all business lines and received positive feedback.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
