Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai
This article presents a detailed case study of how ZhaiZhai designed and implemented a unified monitoring platform—combining business services, middleware, and operations resources—by selecting Prometheus and M3DB, automating Grafana dashboards, creating a low‑noise alerting system, and achieving large‑scale observability with significant cost and efficiency gains.
Facing dual demands of cost reduction and efficiency improvement, ZhaiZhai built a unified monitoring platform that integrates business services, middleware, and operations resources.
The platform consolidates numerous existing monitoring tools (Cat, Nightingale, Prometheus, TiDB, Redis, etc.) and addresses challenges such as high learning cost, fragmented systems, and heavy maintenance.
After evaluating options, the team chose to develop a new system from scratch, selecting Prometheus for metric collection and M3DB as remote storage, with ETCD for metadata.
Key architectural decisions include a push model for business services, service discovery via a CMDB, and retaining pull mode for middleware; a single Prometheus server pulls metrics while clients push aggregated data to M3DB.
Dashboard automation was implemented by generating Grafana dashboards programmatically, supporting built‑in, business, and custom metrics, and providing role‑based access control and SSO integration.
A custom alerting system was built to reduce noise, featuring grouping, hierarchical severity (P0‑P5), suppression, and multi‑channel notifications, achieving a 98% reduction in alert volume.
In production the platform monitors over 1,000 services, 4,000 instances, and 1,000 physical machines, storing 13 TB of data, handling 750 QPS writes and 580 k samples per second.
The result is a scalable, open‑source‑friendly monitoring solution that improves stability, efficiency, and response capability for the business.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.