From Zero to Scalable Monitoring: Lessons from Building a 200‑Service Platform
Over two years, we built a monitoring system covering 200+ services and 700+ instances, evolving from ad‑hoc Nginx logs to a Prometheus‑based observability platform with unified dashboards, automated alerts, and lessons on metric selection, alert fatigue, and fault isolation.
Background
In the past two years we built a monitoring system from scratch for the entire business group. The system now monitors over 200 services and 700 instances, collecting tens of thousands of metrics. This article summarizes the journey, the pitfalls we encountered, and the lessons we learned.
Starting from Nothing
When the business first launched there was no monitoring at all; we relied on Nginx metrics and user feedback to discover failures, and used log systems for troubleshooting, tracing problems layer by layer from upstream services.
Exhausted
After the first major version went live we began integrating monitoring. The initial choice was the openfalcon platform built by a sibling team, using Grafana for dashboards, and we started creating multi‑dimensional views.
Service dimension: Provide client and server views, analyzing status, performance, quality, and capacity to decide which metrics to add to dashboards.
We also considered the business dimension, focusing on key business paths and building a business monitoring tree for rapid issue localization, and the product dimension, analyzing key product metrics and constructing shared dashboards.
At this stage we invested a lot of manpower in monitoring but achieved little, mainly due to the following reasons:
Building dashboards from the bottom up required continuous effort to fill metric gaps.
We focused more on service‑quality metrics than on product‑quality metrics, lacking sufficient understanding of product indicators.
The combination of openfalcon and Grafana imposed high maintenance costs and limited capabilities.
Everyone had to learn the basics of monitoring, which presented a high entry barrier.
Alert relevance to business was low; minor fluctuations triggered false alarms, while critical issues often lacked proper alerts.
Consequently, despite heavy effort in building and maintaining dashboards and handling alerts, the results were unsatisfactory.
The Road Ahead
After the first version stabilized, we had a long period without major new requirements, prompting us to rethink our approach. Internally we began developing our own RPC framework, and drawing on experience from WeChat, we turned our attention to data‑driven monitoring platforms such as Prometheus.
In monitoring we adopted an SDK for data reporting, Prometheus for data collection, and Grafana for dashboards, creating a more flexible and convenient observability solution.
Service dimension: While developing the RPC framework we embedded service‑level reporting directly into the framework and provided an SDK for other teams to integrate existing services. We then maintained two sets of dashboards: a global view for daily operations and detailed views for troubleshooting.
Business & Product dimension: The SDK offers a simple, unified reporting interface, facilitating the construction of business‑ and product‑related dashboards.
Thus the service‑level data became an iteratively optimizable unified view, and as our monitoring knowledge deepened the dashboards became increasingly effective.
For alerting we combined Prometheus (data calculation), Promgen (rule management), AlertManager (alert handling), Webhook (alert invocation), and enterprise WeChat groups to build a complete alert chain.
Freedom Achieved
In monitoring and alerting we frequently encounter the following problems:
Threshold setting: determining appropriate thresholds for different business scenarios and metrics.
Traffic fluctuation: ideally the system should recognize traffic patterns and adjust alert thresholds automatically.
Transient alerts: brief, recurring issues that appear sporadically and are often ignored.
Information overload: excessive alerts flood inboxes, reducing their usefulness.
Fault localization: complex scenarios require detailed context (time, location, error code, service, interface, etc.) to pinpoint the root cause.
How have we addressed these issues?
For problems 1 and 2 we introduced anomaly‑detection algorithms into the monitoring platform, achieving good results.
Problem 3 was solved using Prometheus' native capabilities.
For problem 4 we tiered alert metrics, configuring only critical alerts at the top of the call chain and linking them to detailed dashboards, resulting in fewer, more precise alerts that are easier to maintain.
Problem 5 remains unsolved, but we have a plan to further optimize it in the future.
Future Road
Leveraging Prometheus' data platform capabilities, we aim to build a tree‑shaped call graph of all services and automatically analyze error changes for root‑cause detection, which is our next step.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.