From Zero to Scalable Monitoring: Lessons from Building a 200‑Service Platform
Over two years, we built a monitoring system covering 200+ services and 700+ instances, evolving from ad‑hoc Nginx logs to a Prometheus‑based observability platform with unified dashboards, automated alerts, and lessons on metric selection, alert fatigue, and root‑cause analysis.
Background
In the past two years we started from scratch to build a monitoring system for an entire business group. The system now monitors over 200 services and 700 instances, collecting tens of thousands of metrics. This article summarizes the journey, the pitfalls we fell into, and the lessons we learned.
Starting from Scratch
When the business first launched, there was no monitoring at all. We relied on Nginx metrics from the ingress layer, and only discovered failures after users reported them. Troubleshooting depended on log analysis, tracing from upstream services layer by layer. Each deployment felt like a drumbeat of anxiety.
Running Exhausted
After the first major version went live, we began integrating monitoring using the open‑source OpenFalcon stack with Grafana dashboards. We tried to build multi‑dimensional dashboards based on our understanding of holistic monitoring.
Service dimension: Provide client and server views, covering status, performance, quality, and capacity, and decide which metrics to add to the dashboards.
We also created dashboards for the business dimension, focusing on key business paths and building a monitoring tree for rapid issue localization, and for the product dimension, analyzing key product metrics and constructing shared dashboards.
At this stage we invested a lot of manpower in monitoring but achieved little, mainly because:
Bottom‑up dashboard construction required constant manual effort to fill metric gaps.
We focused more on service‑quality metrics than on product‑quality metrics, lacking sufficient understanding of the latter.
The capabilities of OpenFalcon + Grafana made dashboard and alert maintenance extremely labor‑intensive.
Everyone had to learn the basics of monitoring, which raised the entry barrier.
Alert rules were not tightly coupled with business context, leading to frequent false alarms and missed critical alerts.
Consequently, we spent a lot of time maintaining dashboards and handling alerts without achieving the desired efficiency.
A Glimpse of the Future
After the first version stabilized, there were no major new requirements for a long period, prompting us to rethink our approach. Internally we started developing our own RPC framework, inspired by experience at WeChat, and turned our attention to data‑driven monitoring platforms such as Prometheus.
We adopted an SDK for data reporting, Prometheus for data collection, and Grafana for visualization, creating more flexible and convenient dashboards.
Service dimension: While building the RPC framework we embedded reporting directly into it and provided an SDK for other teams to integrate existing services. We then maintained two sets of service dashboards: a global overview for daily operations and detailed views for troubleshooting.
Business & Product dimension: The SDK offers a simple, unified reporting interface, making it easy to build dashboards for business and product metrics.
Over time the service‑level data became an iteratively optimizable unified view, and as our monitoring knowledge deepened the dashboards became more intuitive to use.
For alerting we combined Prometheus (data calculation), Promgen (rule management), AlertManager (alert handling), Webhook calls, and enterprise WeChat groups to build a complete alert chain.
Achieving Freedom
We frequently encountered the following problems in monitoring and alerting:
Threshold setting: Different business scenarios and metrics require careful calibration of alert thresholds.
Traffic fluctuation: Ideally the system should detect traffic patterns and automatically adjust thresholds.
Transient alerts: Short‑lived issues appear intermittently, making it hard to decide whether to ignore them.
Information overload: Over‑alerting floods inboxes, defeating the purpose of alerts.
Fault localization: Complex environments require alerts to contain rich context (time, location, error code, region, data center, service, interface) to aid root‑cause analysis.
Our current solutions are:
For problems 1 and 2 we introduced anomaly‑detection algorithms into the monitoring platform.
Problem 3 is addressed by leveraging Prometheus' native capabilities.
For problem 4 we tiered alert metrics, configuring only critical alerts at the top of the call chain and linking them to detailed dashboards for precise investigation.
Problem 5 remains unsolved, but we have a roadmap for future optimization.
Future Road
With Prometheus' data platform we plan to build a service call tree for all business services and automatically analyze error trends for root‑cause detection.
(Global dashboard view)
(Detailed dashboard view)
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.