Mastering Prometheus: From Metrics Collection to Alerting and Visualization
This comprehensive guide explains Prometheus' architecture, metric collection models, storage format, query language (PromQL), alerting workflow, configuration reload methods, metric types, custom exporters, and how to visualise data with Grafana, providing a complete end‑to‑end monitoring solution.
Introduction
Prometheus, named after the Greek titan who foresaw the future, is an open‑source monitoring system that collects, stores and visualises metrics to give insight into system health.
Overall Ecosystem
Prometheus provides a full stack from metric exposition, scraping, storage, visualisation, to alerting. Each monitored service is a Job with one or more targets . An official SDK lets you expose custom metrics, and exporters exist for common components such as MySQL or Consul.
Short‑lived scripts or services that cannot be scraped directly can push metrics to a
PushGateway, which Prometheus then scrapes.
Metric Scraping Models
Pull model : Prometheus actively pulls metrics from the exposed endpoint at regular intervals (default 1 minute, configurable via
scrape_interval).
Push model : Monitored services push metrics to a gateway; Prometheus pulls from the gateway.
Metric Storage and Query
Scraped metrics are stored in Prometheus' built‑in time‑series database. Queries are performed with PromQL, either via the built‑in Web UI or third‑party tools such as Grafana.
Alerting
Alertmanager receives alerts generated by Prometheus when a PromQL expression exceeds a defined threshold. Alerts can be routed to email, WeChat, etc.
Working Principle
Service Registration
Each monitored service registers as a Job with a list of targets . Registration can be static (IP and port listed in
scrape_configs) or dynamic using service‑discovery mechanisms (Consul, DNS, Kubernetes, etc.). Example static config:
<code>scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
</code>Dynamic Consul example:
<code>- job_name: "node_export_consul"
metrics_path: "/node_metrics"
scheme: http
consul_sd_configs:
- server: localhost:8500
services:
- node_exporter
</code>Configuration Reload
After editing
prometheus.yml, reload the configuration without restarting by starting Prometheus with
--web.enable-lifecycleand sending a POST request to
/-/reload:
<code>prometheus --config.file=/usr/local/etc/prometheus.yml --web.enable-lifecycle</code> <code>curl -v -X POST http://localhost:9090/-/reload</code>The reload handler is implemented in the web module and signals the main loop to reload the config.
Metric Types
Prometheus stores all metrics as time series but defines four logical types to aid interpretation:
Counter : monotonically increasing (e.g., request count).
Gauge : can go up or down (e.g., memory usage).
Histogram : bucketed distribution for latency or size.
Summary : pre‑computed quantiles.
Exporters and Custom Exporters
Use community exporters for components like MySQL or Kafka, or write a custom exporter with the Go client library:
<code>package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
</code>Register custom metrics (counter, gauge, histogram, summary) and optionally add labels using
NewCounterVec,
NewGaugeVec, etc.
<code>myCounter := prometheus.NewCounter(prometheus.CounterOpts{Name: "my_counter_total", Help: "custom counter"})
myGauge := prometheus.NewGauge(prometheus.GaugeOpts{Name: "my_gauge", Help: "custom gauge"})
myHistogram := prometheus.NewHistogram(prometheus.HistogramOpts{Name: "my_histogram", Buckets: []float64{0.1,0.2,0.3,0.4,0.5}})
mySummary := prometheus.NewSummary(prometheus.SummaryOpts{Name: "my_summary", Objectives: map[float64]float64{0.5:0.05,0.9:0.01,0.99:0.001}})
prometheus.MustRegister(myCounter, myGauge, myHistogram, mySummary)
</code>PromQL Basics
PromQL expressions are of four kinds: string literals, scalars, instant vectors, and range vectors. Examples:
Instant query:
go_gc_duration_seconds_countLabel filter:
go_gc_duration_seconds_count{instance="127.0.0.1:9600"}Regex filter:
go_gc_duration_seconds_count{instance=~"localhost.*"}Range query (last 5 minutes):
go_gc_duration_seconds_count[5m]Common functions include
rate()(average per‑second increase),
irate()(instantaneous rate), and aggregation functions such as
sum() by()or
sum() without(). Quantile calculation for histograms uses
histogram_quantile().
Grafana Visualization
Connect Grafana to Prometheus as a data source, create dashboards, and write PromQL queries in panels to visualise metrics. Dashboards can be exported as JSON for reuse.
Alertmanager Configuration
Define alert rules in a separate file (e.g.,
alert_rules.yml) and reference it from
prometheus.yml. Example rule triggers when a job named
http_srvis down for one minute:
<code>groups:
- name: simulator-alert-rule
rules:
- alert: HttpSimulatorDown
expr: sum(up{job="http_srv"}) == 0
for: 1m
labels:
severity: critical
</code>Configure Alertmanager to route alerts to email, Slack, etc., and optionally silence alerts via its Web UI.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.