From Zero to Scalable Monitoring: Lessons Learned Building a Prometheus‑Based SRE Platform
This article recounts a two‑year journey of building a company‑wide monitoring system, detailing early challenges with openfalcon, the shift to a Prometheus‑Grafana stack, architectural decisions across service, business, and product dimensions, and practical solutions to alert fatigue, threshold setting, and fault isolation.
Background
Over two years a monitoring platform was built from scratch for an entire business group, eventually covering more than 200 services, 700 instances, and tens of thousands of metrics.
Initial State – No Monitoring
When the business launched, there was no internal monitoring. Only Nginx ingress metrics were available, and incidents were discovered solely after user complaints. Troubleshooting required manual log inspection across multiple services, making each deployment stressful.
First Attempt with OpenFalcon + Grafana
After the first major release the team adopted the open‑source OpenFalcon system (maintained by a sibling team) and used Grafana for dashboards. Three monitoring dimensions were defined:
Service dimension: client and server views covering status, performance, quality, and capacity.
Business dimension: key‑path business monitoring trees for rapid fault localisation.
Product dimension: product‑level key metrics and shared dashboards.
Despite heavy manpower, the approach delivered limited value because:
Bottom‑up dashboard construction required continuous manual addition of missing metrics.
Metrics focused on service quality, neglecting product‑level insights.
OpenFalcon + Grafana imposed high operational overhead for metric collection and alert maintenance.
Engineers faced a steep learning curve to understand basic monitoring concepts.
Alert rules were loosely tied to business logic, causing frequent false alarms and missed critical alerts.
Transition to Prometheus
When the first version stabilised, the team explored a data‑centric stack built around an internally developed RPC framework (inspired by experience at WeChat). The stack consists of: SDK for metric emission (embedded in the RPC framework and provided for legacy services). Prometheus for scraping and storing metrics. Grafana for visualisation.
Monitoring dimensions were re‑implemented as follows:
Service dimension: metrics emitted directly from the RPC framework; two sets of dashboards were maintained – a global view for daily operations and detailed views for deep troubleshooting.
Business & Product dimensions: a unified, simple SDK interface enabled rapid construction of business‑ and product‑level dashboards.
This unified view allowed iterative optimisation as operational experience grew.
Alerting Pipeline
The alerting chain was rebuilt with: Prometheus for rule evaluation. Promgen for managing alert rule definitions. Alertmanager for routing alerts.
Webhook integrations to enterprise WeChat groups for notification.
Recurring Alert Challenges
Threshold definition: determining appropriate limits for diverse business scenarios.
Traffic volatility: needing dynamic thresholds that adapt to natural traffic patterns.
Transient alerts: short‑lived spikes that are hard to investigate promptly.
Information overload: excessive alerts flooding inboxes when every possible condition is monitored.
Fault localisation: complex alerts must convey time, location, error codes, and related service/component context.
Implemented Solutions
Integrated anomaly‑detection algorithms to generate adaptive thresholds, addressing static‑threshold problems.
Leveraged Prometheus’ native functions (e.g., rate, increase, recording rules) to create traffic‑aware thresholds.
Adopted a hierarchical alert strategy: only top‑level critical alerts are defined; each alert includes a link to a detailed Grafana dashboard for deeper analysis.
Remaining issues such as more sophisticated transient‑alert handling are slated for future work.
Future Direction
With the Prometheus‑driven data platform, the team plans to generate service call‑tree visualisations and automated root‑cause analysis, extending the monitoring system’s intelligence.
Reference
Original article: https://www.jianshu.com/p/06c7dd803d4a
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
