How to Build a Semi‑Automated Prometheus Monitoring System for <500 Nodes
This article details a practical approach to constructing a semi‑automated monitoring solution for small‑scale services using Prometheus, covering active monitoring concepts, metric types, service‑framework integration, Grafana dashboards, Alertmanager routing, and deployment on Mesos.
Active Monitoring
Monitoring is the foundation of operations; three types: active, passive, side‑channel.
Active monitoring : embed instrumentation before deployment, via logs, agents, REST API, etc.
Passive monitoring : black‑box checks such as ping.
Side‑channel monitoring : external data like user feedback.
We focus on active monitoring at the business level.
Prometheus
Prometheus is an open‑source monitoring system, a non‑official implementation of Google’s Borgmon. It is chosen over other TSDBs because of its powerful query language PromQL.
Example metric http_requests_total records request counts. Sample data:
http_requests_total{instance="1.1.1.1:80",job="cluster1",location="/a"} 100
http_requests_total{instance="1.1.1.1:80",job="cluster1",location="/b"} 110
http_requests_total{instance="1.1.1.2:80",job="cluster2",location="/b"} 100
http_requests_total{instance="1.1.1.3:80",job="cluster3",location="/c"} 110PromQL can aggregate on any label, e.g.:
sum(rate(http_requests_total[1m])) by (instance) – per‑instance QPS.
sum(rate(http_requests_total[1m])) by (job, location) – QPS per cluster and path.
Metric types include Counter, Gauge, and Histogram, each suited for different data.
Service Framework Refactoring
The team uses a unified service framework with a master/worker model, multi‑protocol support, module loading, and asynchronous downstream API.
To expose internal metrics we added:
A registry for metric objects (Counter, Histogram) supporting multi‑thread and multi‑process.
Instrumentation points that use flexible labels (e.g., a single http_requests_total metric with a location label).
Coverage of common business metrics such as QPS, latency, and error ratio.
Data Collection and Visualization
After exposing metrics, Prometheus scrapes them. Important considerations:
Metric names must conform to Prometheus naming rules; invalid data stops scraping.
Limit exported data to avoid unnecessary load.
Prometheus is CPU‑ and I/O‑intensive; use ample CPU, memory, and SSD storage.
Grafana is preferred for dashboards; a unified dashboard template can be packaged as a Grafana plugin.
Dashboard rows show real‑time QPS, latency, queue time, core dump count, downstream failure rates, etc.
Example PromQL query for top‑5 downstream error rates:
topk(5, 100*sum(rate(downstream_responses{error_code!="0"}[5m])) by (job, server)/sum(rate(downstream_responses[5m])) by (job, server))The range selector [5m] balances responsiveness and stability; alerts typically use a 1‑minute window.
AlertManager
Prometheus evaluates alert rules after each scrape and forwards firing alerts to AlertManager, which can route them via webhook, email, Slack, or third‑party services.
AlertManager currently handles simple routing; more advanced hierarchical alerting is a future topic.
Deployment on Mesos
Prometheus, Grafana, and the service framework are packaged together and deployed on a Mesos cluster. Prometheus runs on a dedicated node with persistent storage; Grafana is accessed through an HAProxy/Consul‑managed endpoint.
Conclusion
Prometheus offers a powerful, rapidly evolving monitoring solution suitable for small‑to‑medium teams. Combined with a custom service framework, unified Prometheus/Grafana templates, and Mesos deployment, it enables a semi‑automated, real‑time business monitoring system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
