Operations 11 min read

How to Build a Semi‑Automated Prometheus Monitoring System for <500 Nodes

This article details a practical approach to constructing a semi‑automated monitoring solution for small‑scale services using Prometheus, covering active monitoring concepts, metric types, service‑framework integration, Grafana dashboards, Alertmanager routing, and deployment on Mesos.

MaGe Linux Operations

Jul 10, 2022

How to Build a Semi‑Automated Prometheus Monitoring System for <500 Nodes

Active Monitoring

Monitoring is the foundation of operations; three types: active, passive, side‑channel.

Active monitoring : embed instrumentation before deployment, via logs, agents, REST API, etc.

Passive monitoring : black‑box checks such as ping.

Side‑channel monitoring : external data like user feedback.

We focus on active monitoring at the business level.

Prometheus

Prometheus is an open‑source monitoring system, a non‑official implementation of Google’s Borgmon. It is chosen over other TSDBs because of its powerful query language PromQL.

Example metric http_requests_total records request counts. Sample data:

http_requests_total{instance="1.1.1.1:80",job="cluster1",location="/a"} 100
http_requests_total{instance="1.1.1.1:80",job="cluster1",location="/b"} 110
http_requests_total{instance="1.1.1.2:80",job="cluster2",location="/b"} 100
http_requests_total{instance="1.1.1.3:80",job="cluster3",location="/c"} 110

PromQL can aggregate on any label, e.g.:

sum(rate(http_requests_total[1m])) by (instance) – per‑instance QPS.

sum(rate(http_requests_total[1m])) by (job, location) – QPS per cluster and path.

Metric types include Counter, Gauge, and Histogram, each suited for different data.

Service Framework Refactoring

The team uses a unified service framework with a master/worker model, multi‑protocol support, module loading, and asynchronous downstream API.

To expose internal metrics we added:

A registry for metric objects (Counter, Histogram) supporting multi‑thread and multi‑process.

Instrumentation points that use flexible labels (e.g., a single http_requests_total metric with a location label).

Coverage of common business metrics such as QPS, latency, and error ratio.

Data Collection and Visualization

After exposing metrics, Prometheus scrapes them. Important considerations:

Metric names must conform to Prometheus naming rules; invalid data stops scraping.

Limit exported data to avoid unnecessary load.

Prometheus is CPU‑ and I/O‑intensive; use ample CPU, memory, and SSD storage.

Grafana is preferred for dashboards; a unified dashboard template can be packaged as a Grafana plugin.

Dashboard rows show real‑time QPS, latency, queue time, core dump count, downstream failure rates, etc.

Example PromQL query for top‑5 downstream error rates:

topk(5, 100*sum(rate(downstream_responses{error_code!="0"}[5m])) by (job, server)/sum(rate(downstream_responses[5m])) by (job, server))

The range selector [5m] balances responsiveness and stability; alerts typically use a 1‑minute window.

AlertManager

Prometheus evaluates alert rules after each scrape and forwards firing alerts to AlertManager, which can route them via webhook, email, Slack, or third‑party services.

AlertManager currently handles simple routing; more advanced hierarchical alerting is a future topic.

Deployment on Mesos

Prometheus, Grafana, and the service framework are packaged together and deployed on a Mesos cluster. Prometheus runs on a dedicated node with persistent storage; Grafana is accessed through an HAProxy/Consul‑managed endpoint.

Conclusion

Prometheus offers a powerful, rapidly evolving monitoring solution suitable for small‑to‑medium teams. Combined with a custom service framework, unified Prometheus/Grafana templates, and Mesos deployment, it enables a semi‑automated, real‑time business monitoring system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Metrics prometheus grafana Alertmanager

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.