Master Prometheus: From Metrics Collection to Alerting and Visualization
Prometheus is an open‑source monitoring solution that covers metric exposition, scraping, storage, querying, visualization, and alerting, and this guide walks through its architecture, configuration, custom exporters, PromQL queries, Grafana integration, and alert management, providing a comprehensive introduction for developers and ops engineers.
Introduction
Prometheus is an open‑source, full‑stack monitoring solution. It provides metric exposition, scraping, storage, querying, visualization, and alerting, allowing you to gain insight into system health and quickly locate problems.
Ecosystem Overview
Prometheus consists of components for exposing metrics, scraping them, storing them in a built‑in time‑series database, querying with PromQL, visualizing via its own Web UI or Grafana, and sending alerts through Alertmanager.
Metric Exposure
Each monitored service is a Job . Services can expose metrics directly via an SDK or through exporters (e.g., MySQL, Consul). For short‑lived jobs, the Pushgateway can be used to push metrics.
Metric Scraping
Prometheus primarily uses the Pull model: it periodically scrapes the
/metricsendpoint of targets. The default scrape interval is one minute and can be configured via
scrape_intervalin
prometheus.yml.
global:
scrape_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]Storage and Query
Scraped metrics are stored as time‑series in an internal TSDB. PromQL is used to query these series, either instantly or over a range.
Alerting
Alertmanager receives alerts generated from PromQL expressions. Alerts can be routed to email, Slack, etc., and support silencing and grouping.
Configuration Details
Service Registration
Jobs can be registered statically in
scrape_configsor dynamically via service discovery (Consul, DNS, Kubernetes, etc.). Example static registration:
scrape_configs:
- job_name: "node_export_consul"
metrics_path: /node_metrics
scheme: http
consul_sd_configs:
- server: localhost:8500
services:
- node_exporterDynamic Registration Note
When using dynamic registration, ensure
metrics_pathis set correctly; otherwise Prometheus may report an "INVALID" start token error.
Configuration Reload
Start Prometheus with
--web.enable-lifecycleto allow runtime reloads via
POST /-/reload.
prometheus --config.file=/usr/local/etc/prometheus.yml --web.enable-lifecycle
curl -v --request POST 'http://localhost:9090/-/reload'Metric Types
Prometheus defines four metric types:
Counter : monotonically increasing (e.g., request counts).
Gauge : can go up and down (e.g., memory usage).
Histogram : bucketed distribution for observations.
Summary : pre‑computed quantiles.
Exporting Metrics
Use the official
client_golanglibrary to expose metrics. A minimal exporter:
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}Custom counters, gauges, histograms, and summaries can be defined and registered:
myCounter := prometheus.NewCounter(prometheus.CounterOpts{Name: "my_counter_total", Help: "custom counter"})
myGauge := prometheus.NewGauge(prometheus.GaugeOpts{Name: "my_gauge_num", Help: "custom gauge"})
myHistogram := prometheus.NewHistogram(prometheus.HistogramOpts{Name: "my_histogram_bucket", Help: "custom histogram", Buckets: []float64{0.1,0.2,0.3,0.4,0.5}})
mySummary := prometheus.NewSummary(prometheus.SummaryOpts{Name: "my_summary_bucket", Help: "custom summary", Objectives: map[float64]float64{0.5:0.05,0.9:0.01,0.99:0.001}})
prometheus.MustRegister(myCounter, myGauge, myHistogram, mySummary)Metrics with labels use
NewCounterVec(or similar) and
With(prometheus.Labels{...})to set values.
myCounterVec := prometheus.NewCounterVec(prometheus.CounterOpts{Name: "my_counter_total", Help: "custom counter"}, []string{"label1", "label2"})
myCounterVec.With(prometheus.Labels{"label1":"1", "label2":"2"}).Inc()PromQL Basics
PromQL expressions can be instant vectors, range vectors, scalars, or strings. Common functions include
rate(),
irate(),
sum()with
by()or
without(), and
histogram_quantile()for percentile calculations.
# Example: QPS per path
sum(rate(demo_api_request_duration_seconds_count{job="demo",method="GET",status="200"}[5m])) by (path)Grafana Visualization
Grafana can be added as a data source pointing to Prometheus, then dashboards are built using PromQL queries.
Alertmanager
Define alert rules in a separate file and reference it from
prometheus.yml. Example rule triggers when a job has no up targets for one minute:
groups:
- name: simulator-alert-rule
rules:
- alert: HttpSimulatorDown
expr: sum(up{job="http_srv"}) == 0
for: 1m
labels:
severity: criticalConfigure Alertmanager with SMTP settings to send email notifications, and use its UI to silence alerts.
Summary
This guide provides a complete overview of Prometheus, covering its architecture, metric collection models, configuration, custom exporters, query language, visualization with Grafana, and alerting with Alertmanager, equipping developers and operations engineers to monitor and observe their systems effectively.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.