Operations 21 min read

Mastering Prometheus: From Metrics Collection to Alerting and Visualization

This comprehensive guide explains Prometheus' architecture, metric collection models, storage format, query language (PromQL), alerting workflow, configuration reload methods, metric types, custom exporters, and how to visualise data with Grafana, providing a complete end‑to‑end monitoring solution.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering Prometheus: From Metrics Collection to Alerting and Visualization

Introduction

Prometheus, named after the Greek titan who foresaw the future, is an open‑source monitoring system that collects, stores and visualises metrics to give insight into system health.

Overall Ecosystem

Prometheus provides a full stack from metric exposition, scraping, storage, visualisation, to alerting. Each monitored service is a Job with one or more targets . An official SDK lets you expose custom metrics, and exporters exist for common components such as MySQL or Consul.

Short‑lived scripts or services that cannot be scraped directly can push metrics to a

PushGateway

, which Prometheus then scrapes.

Metric Scraping Models

Pull model : Prometheus actively pulls metrics from the exposed endpoint at regular intervals (default 1 minute, configurable via

scrape_interval

).

Push model : Monitored services push metrics to a gateway; Prometheus pulls from the gateway.

Metric Storage and Query

Scraped metrics are stored in Prometheus' built‑in time‑series database. Queries are performed with PromQL, either via the built‑in Web UI or third‑party tools such as Grafana.

Alerting

Alertmanager receives alerts generated by Prometheus when a PromQL expression exceeds a defined threshold. Alerts can be routed to email, WeChat, etc.

Working Principle

Service Registration

Each monitored service registers as a Job with a list of targets . Registration can be static (IP and port listed in

scrape_configs

) or dynamic using service‑discovery mechanisms (Consul, DNS, Kubernetes, etc.). Example static config:

<code>scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
</code>

Dynamic Consul example:

<code>- job_name: "node_export_consul"
  metrics_path: "/node_metrics"
  scheme: http
  consul_sd_configs:
    - server: localhost:8500
      services:
        - node_exporter
</code>

Configuration Reload

After editing

prometheus.yml

, reload the configuration without restarting by starting Prometheus with

--web.enable-lifecycle

and sending a POST request to

/-/reload

:

<code>prometheus --config.file=/usr/local/etc/prometheus.yml --web.enable-lifecycle</code>
<code>curl -v -X POST http://localhost:9090/-/reload</code>

The reload handler is implemented in the web module and signals the main loop to reload the config.

Metric Types

Prometheus stores all metrics as time series but defines four logical types to aid interpretation:

Counter : monotonically increasing (e.g., request count).

Gauge : can go up or down (e.g., memory usage).

Histogram : bucketed distribution for latency or size.

Summary : pre‑computed quantiles.

Exporters and Custom Exporters

Use community exporters for components like MySQL or Kafka, or write a custom exporter with the Go client library:

<code>package main
import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}
</code>

Register custom metrics (counter, gauge, histogram, summary) and optionally add labels using

NewCounterVec

,

NewGaugeVec

, etc.

<code>myCounter := prometheus.NewCounter(prometheus.CounterOpts{Name: "my_counter_total", Help: "custom counter"})
myGauge := prometheus.NewGauge(prometheus.GaugeOpts{Name: "my_gauge", Help: "custom gauge"})
myHistogram := prometheus.NewHistogram(prometheus.HistogramOpts{Name: "my_histogram", Buckets: []float64{0.1,0.2,0.3,0.4,0.5}})
mySummary := prometheus.NewSummary(prometheus.SummaryOpts{Name: "my_summary", Objectives: map[float64]float64{0.5:0.05,0.9:0.01,0.99:0.001}})
prometheus.MustRegister(myCounter, myGauge, myHistogram, mySummary)
</code>

PromQL Basics

PromQL expressions are of four kinds: string literals, scalars, instant vectors, and range vectors. Examples:

Instant query:

go_gc_duration_seconds_count

Label filter:

go_gc_duration_seconds_count{instance="127.0.0.1:9600"}

Regex filter:

go_gc_duration_seconds_count{instance=~"localhost.*"}

Range query (last 5 minutes):

go_gc_duration_seconds_count[5m]

Common functions include

rate()

(average per‑second increase),

irate()

(instantaneous rate), and aggregation functions such as

sum() by()

or

sum() without()

. Quantile calculation for histograms uses

histogram_quantile()

.

Grafana Visualization

Connect Grafana to Prometheus as a data source, create dashboards, and write PromQL queries in panels to visualise metrics. Dashboards can be exported as JSON for reuse.

Alertmanager Configuration

Define alert rules in a separate file (e.g.,

alert_rules.yml

) and reference it from

prometheus.yml

. Example rule triggers when a job named

http_srv

is down for one minute:

<code>groups:
- name: simulator-alert-rule
  rules:
  - alert: HttpSimulatorDown
    expr: sum(up{job="http_srv"}) == 0
    for: 1m
    labels:
      severity: critical
</code>

Configure Alertmanager to route alerts to email, Slack, etc., and optionally silence alerts via its Web UI.

monitoringObservabilitymetricsalertingPrometheusPromQLGrafana
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.