Operations 21 min read

Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB

This article describes how ZhaiZhai unified dozens of legacy monitoring tools into a single, all‑in‑one observability platform by adopting Prometheus + Grafana, extending the Prometheus client to push metrics to M3DB, automating Grafana dashboard creation, and building a custom alerting service to reduce operational complexity and improve visibility across business, middleware, and infrastructure services.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB

The monitoring system, often called the "third eye," is essential for operations, yet integrating multiple open‑source tools into a unified view for different roles is challenging.

Initially, ZhaiZhai used a self‑built ZZMonitor that collected JVM metrics and exposed four aggregation functions (SUM, MAX, MIN, AVG). Data were stored in MySQL tables with a 7‑day retention, leading to limited query flexibility, high maintenance cost, and scattered dashboards.

Other internal tools included a RPC‑based service management platform, Docker‑focused cloud monitoring, Open‑Falcon, Nightingale, Redis monitoring, MySQL monitoring, Zabbix, and Prometheus for TiDB and Nginx, resulting in ten separate monitoring systems.

After evaluating options, the team chose Prometheus + Grafana for its flexible PromQL, rich exporter ecosystem, and powerful visualization, while acknowledging its architectural complexity, learning curve, and alerting limitations.

To address these issues, the architecture was redesigned:

Prometheus pulls metrics from services that expose HTTP endpoints; services register their addresses in a CMDB for discovery.

Metrics are pushed directly to M3DB (Uber’s open‑source TSDB) via Prometheus remote‑write, eliminating the need for a separate Prometheus server for business services.

Grafana dashboards are generated automatically per service, with rows for JVM, logs, thread pools, DB pools, Redis, MQ, RPC, container, and host metrics.

RBAC is integrated with ZhaiZhai’s SSO (WeChat Enterprise) via Grafana Auth Proxy, ensuring only authorized users can edit or view dashboards.

Dashboard creation is automated: when a metric is defined, the SDK registers it with a target row and help text, and the system inserts the corresponding JSON panel into the Grafana dashboard.

Metric types are handled as follows:

Counter counter = Counter.build()
    .name("upload_picture_count")
    .row("Core Business Monitoring")
    .help("Number of uploaded pictures")
    .register();
Gauge gauge = Gauge.build()
    .name("active_thread_size")
    .row("Thread Pool Monitoring")
    .help("Active thread count")
    .register();
Histogram histogram = Histogram.build()
    .name("age_distribution")
    .row("User Monitoring")
    .help("User age distribution")
    .buckets(10, 20, 30, 40, 50, 60, 70)
    .register();

Alerting was initially handled by Grafana’s ngalert, which required manual PromQL writing and suffered from performance issues at scale. A custom alert engine was built that stores alert definitions in MySQL, translates UI‑configured thresholds into PromQL, and executes alerts with sharding, deduplication, silencing, and history support.

The final system delivers:

Unified, low‑maintenance architecture without a separate service registry for business services.

Consistent dashboard style, automatic row creation, and per‑service permissions.

Zero‑dependency Prometheus client for business services that pushes directly to M3DB.

Simple, UI‑driven alert configuration that hides PromQL complexity.

Since launch, the platform has been adopted across all business modules, middleware SDKs, and infrastructure components, receiving positive feedback for its ease of use, rich feature set, and reduced operational overhead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringArchitectureObservabilityAlertingPrometheusGrafanaM3DB
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.