Operations 55 min read

Build a Production‑Ready Prometheus + Grafana Monitoring Stack in Minutes

Learn how to quickly set up a complete, production‑grade monitoring system using Prometheus 3.x and Grafana 11, covering installation, service discovery, PromQL queries, recording rules, Alertmanager routing, Grafana dashboards, best‑practice configurations, and troubleshooting for environments of any size.

Ops Community
Ops Community
Ops Community
Build a Production‑Ready Prometheus + Grafana Monitoring Stack in Minutes

Overview

This guide shows how to build a production‑ready monitoring stack with Prometheus 3.x, Alertmanager 0.28, and Grafana 11.5. It covers installation, service discovery, metric collection, alerting, and visualization.

Architecture Overview

+-------------------+      +-------------------+      +-------------------+
|   Prometheus 1   | ---> |   Alertmanager    | ---> |   Notification   |
+-------------------+      +-------------------+      +-------------------+
        |                         |
        v                         v
+-------------------+      +-------------------+
|   Prometheus 2   | ---> |   Grafana UI      |
+-------------------+      +-------------------+

Prometheus scrapes metrics, stores them in a TSDB, and serves queries to Grafana. Alertmanager receives alerts, applies grouping, inhibition, and routing, then forwards them to external channels.

Environment Requirements

OS: Ubuntu 24.04 LTS or Rocky Linux 9.x (Ubuntu recommended)

Prometheus 3.2.x (2025‑end GA) – 15 s scrape interval, TSDB compression ~30 %

Grafana 11.5.x – native Prometheus data source

Alertmanager 0.28.x – routing, inhibition, silencing

node_exporter 1.9.x – host metrics

Installation Steps

1. Install Prometheus

# Create system user
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus

# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

# Download and extract binary (v3.2.1 as of 2026‑03)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v3.2.1/prometheus-3.2.1.linux-amd64.tar.gz
tar xvfz prometheus-3.2.1.linux-amd64.tar.gz

# Install binaries
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

# Copy console files
sudo cp -r consoles /etc/prometheus/
sudo cp -r console_libraries /etc/prometheus/

# Verify
prometheus --version

2. Install node_exporter on all targets

# Create node_exporter user
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter

# Download and extract
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.0/node_exporter-1.9.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.9.0.linux-amd64.tar.gz

# Install binary
sudo cp node_exporter-1.9.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Systemd service
cat <<'EOF' >/etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

# Verify
curl -s http://localhost:9100/metrics | head -20

3. Install Alertmanager

# Download and extract
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
tar xvfz alertmanager-0.28.1.linux-amd64.tar.gz

# Install binaries
sudo cp alertmanager-0.28.1.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.28.1.linux-amd64/amtool /usr/local/bin/

# Create user and directories
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

# Systemd service
cat <<'EOF' >/etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=:9093
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager

Service Discovery

Prometheus can discover targets via static files, file‑SD, or Consul. Choose the method that matches the size of your environment.

Static Config (≤10 nodes)

scrape_configs:
- job_name: "node"
  static_configs:
  - targets: ["192.168.1.10:9100", "192.168.1.11:9100"]
    labels:
      env: "production"
      dc: "bj-01"

File‑SD (10‑200 nodes)

scrape_configs:
- job_name: "node"
  file_sd_configs:
  - files: ["/etc/prometheus/file_sd/nodes.yml"]
    refresh_interval: 30s

Example /etc/prometheus/file_sd/nodes.yml:

- targets:
  - "10.0.1.10:9100"
  - "10.0.1.11:9100"
  labels:
    env: "production"
    group: "web"
    dc: "bj-01"

Consul‑SD (≥200 nodes)

scrape_configs:
- job_name: "node"
  consul_sd_configs:
  - server: "consul.example.com:8500"
    services: []
    tags: ["monitoring"]
  relabel_configs:
  - source_labels: ["__meta_consul_tags"]
    regex: ".*,env=([^,]+),.*"
    target_label: "env"
    replacement: "$1"
  - source_labels: ["__meta_consul_service"]
    target_label: "job"
  - source_labels: ["__meta_consul_dc"]
    target_label: "dc"

PromQL Core Queries

Instant vector : node_cpu_seconds_total{mode!="idle"} Range vector (5 min) : rate(node_network_receive_bytes_total[5m]) Aggregation example :

100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

Recording Rules (Pre‑computed Metrics)

# /etc/prometheus/rules/recording_rules.yml
groups:
- name: node_recording_rules
  interval: 15s
  rules:
  - record: instance:node_cpu_utilization:ratio
    expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
  - record: instance:node_memory_utilization:ratio
    expr: 1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
  - record: instance:node_filesystem_utilization:ratio
    expr: 1 - node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}
  - record: instance:node_network_receive_bytes:rate5m
    expr: sum by (instance) (rate(node_network_receive_bytes_total{device!="lo"}[5m]))
  - record: instance:node_network_transmit_bytes:rate5m
    expr: sum by (instance) (rate(node_network_transmit_bytes_total{device!="lo"}[5m]))
  - record: instance:node_disk_io_utilization:ratio
    expr: max by (instance) (rate(node_disk_io_time_seconds_total[5m]))

Validate with

promtool check rules /etc/prometheus/rules/recording_rules.yml

and reload via curl -X POST http://localhost:9090/-/reload.

Alertmanager Configuration

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.example.com:465"
  smtp_from: "[email protected]"
  smtp_auth_username: "[email protected]"
  smtp_auth_password: "your-smtp-password"
  smtp_require_tls: false

inhibit_rules:
- source_matchers:
  - "alertname=\"HostDown\""
  target_matchers:
  - "severity=~\"warning|info\""
  equal: ["instance"]
- source_matchers:
  - "severity=\"critical\""
  target_matchers:
  - "severity=\"warning\""
  equal: ["alertname", "instance"]

route:
  receiver: "default-email"
  group_by: ["alertname", "instance", "category"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
  - matchers:
    - "severity=\"critical\""
    receiver: "critical-all-channels"
    group_wait: 10s
    repeat_interval: 1h
  - matchers:
    - "category=\"storage\""
    - "severity=\"warning\""
    receiver: "storage-warning"
    repeat_interval: 2h
  - matchers:
    - "severity=\"warning\""
    receiver: "warning-wecom"
    repeat_interval: 4h
  - matchers:
    - "severity=\"info\""
    receiver: "default-email"
    repeat_interval: 12h

receivers:
- name: "default-email"
  email_configs:
  - to: "[email protected]"
    send_resolved: true
    headers:
      subject: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} - {{ .GroupLabels.instance }}"
- name: "critical-all-channels"
  email_configs:
  - to: "[email protected], [email protected]"
    send_resolved: true
  webhook_configs:
  - url: "http://localhost:8060/dingtalk/ops_critical/send"
    send_resolved: true
  - url: "http://localhost:8061/wecom/send"
    send_resolved: true
  - url: "http://localhost:8062/feishu/send"
    send_resolved: true
- name: "storage-warning"
  email_configs:
  - to: "[email protected]"
    send_resolved: true
  webhook_configs:
  - url: "http://localhost:8061/wecom/send"
    send_resolved: true
- name: "warning-wecom"
  webhook_configs:
  - url: "http://localhost:8061/wecom/send"
    send_resolved: true

Grafana Installation & Data Source

Ubuntu 24.04 (APT)

# Install dependencies
sudo apt install -y apt-transport-https software-properties-common wget

# Add GPG key and repo
sudo mkdir -p /etc/apt/keyrings
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

# Install Grafana 11.5
sudo apt update
sudo apt install -y grafana
sudo systemctl daemon-reload
sudo systemctl enable --now grafana-server

Rocky Linux 9 (YUM)

# Add repo file
cat > /etc/yum.repos.d/grafana.repo <<'EOF'
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF

sudo dnf install -y grafana
sudo systemctl enable --now grafana-server

Prometheus Data Source

apiVersion: 1
datasources:
- name: Prometheus
  type: prometheus
  access: proxy
  url: http://localhost:9090
  isDefault: true
  editable: true
  jsonData:
    timeInterval: "15s"
  httpMethod: POST

Dashboard Design & Import

Import community dashboards by ID (e.g., 1860 for Node Exporter Full, 11074 for a lightweight view). Use the following IDs for common services:

Node Exporter Full – 1860

Node Exporter – 11074

Docker Container Monitoring – 893

MySQL Overview – 7362

Redis Dashboard – 763

Nginx Dashboard – 12708

Custom host overview dashboard typically includes panels for CPU, memory, disk usage, and network traffic. Use variables $instance, $env, and $dc to filter across environments.

Full Example Configurations

prometheus.yml (≈50 nodes)

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "prod-bj"
    environment: "production"
    region: "cn-north-1"

rule_files:
  - "/etc/prometheus/rules/recording_rules.yml"
  - "/etc/prometheus/rules/alerts.yml"

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]

scrape_configs:
  # Prometheus itself
  - job_name: "prometheus"
    static_configs:
    - targets: ["localhost:9090"]
      labels:
        component: "monitoring"

  # Alertmanager
  - job_name: "alertmanager"
    static_configs:
    - targets: ["localhost:9093"]
      labels:
        component: "monitoring"

  # Host groups via file_sd (example for web servers)
  - job_name: "node-web"
    file_sd_configs:
    - files: ["/etc/prometheus/file_sd/web_servers.yml"]
      refresh_interval: 30s
    relabel_configs:
    - source_labels: ["__address__"]
      regex: "(.+):\\d+"
      target_label: "hostname"
      replacement: "$1"

  # Additional groups (app, db, cache) follow the same pattern

  # MySQL exporter
  - job_name: "mysql"
    file_sd_configs:
    - files: ["/etc/prometheus/file_sd/mysql_exporters.yml"]
      refresh_interval: 30s
    relabel_configs:
    - source_labels: ["__address__"]
      regex: "(.+):\\d+"
      target_label: "db_host"
      replacement: "$1"

  # Redis exporter
  - job_name: "redis"
    file_sd_configs:
    - files: ["/etc/prometheus/file_sd/redis_exporters.yml"]
      refresh_interval: 30s

  # Nginx exporter
  - job_name: "nginx"
    file_sd_configs:
    - files: ["/etc/prometheus/file_sd/nginx_exporters.yml"]
      refresh_interval: 30s

  # Custom application metrics
  - job_name: "app-metrics"
    metrics_path: "/metrics"
    file_sd_configs:
    - files: ["/etc/prometheus/file_sd/app_metrics.yml"]
      refresh_interval: 30s

  # Blackbox HTTP probing
  - job_name: "blackbox-http"
    metrics_path: "/probe"
    params:
      module: ["http_2xx"]
    static_configs:
    - targets:
      - "https://www.example.com"
      - "https://api.example.com/health"
      - "https://admin.example.com"
    relabel_configs:
    - source_labels: ["__address__"]
      target_label: "__param_target"
    - source_labels: ["__param_target"]
      target_label: "instance"
    - target_label: "__address__"
      replacement: "localhost:9115"

alerts.yml (selected rules)

# /etc/prometheus/rules/alerts.yml
groups:
- name: host_basic_alerts
  rules:
  - alert: HostDown
    expr: up{job=~"node.*"} == 0
    for: 1m
    labels:
      severity: critical
      category: availability
    annotations:
      summary: "Host {{ $labels.instance }} is unreachable"
      description: "Job {{ $labels.job }} / instance {{ $labels.instance }} has been down for >1 min."

  - alert: HostHighCpuUsage
    expr: instance:node_cpu_utilization:ratio > 0.85
    for: 5m
    labels:
      severity: warning
      category: performance
    annotations:
      summary: "CPU usage {{ $value | humanizePercentage }}"
      description: "Instance {{ $labels.instance }} CPU >85 % for 5 min."

  - alert: HostCriticalCpuUsage
    expr: instance:node_cpu_utilization:ratio > 0.95
    for: 3m
    labels:
      severity: critical
      category: performance
    annotations:
      summary: "CPU usage critical {{ $value | humanizePercentage }}"
      description: "Instance {{ $labels.instance }} CPU >95 % for 3 min."

  - alert: HostHighMemoryUsage
    expr: instance:node_memory_utilization:ratio > 0.85
    for: 5m
    labels:
      severity: warning
      category: performance
    annotations:
      summary: "Memory usage {{ $value | humanizePercentage }}"
      description: "Instance {{ $labels.instance }} memory >85 % for 5 min."

  - alert: HostCriticalMemoryUsage
    expr: instance:node_memory_utilization:ratio > 0.95
    for: 3m
    labels:
      severity: critical
      category: performance
    annotations:
      summary: "Memory critical"
      description: "Instance {{ $labels.instance }} memory >95 % for 3 min."

- name: host_disk_alerts
  rules:
  - alert: HostDiskWarning
    expr: instance:node_filesystem_utilization:ratio > 0.80
    for: 5m
    labels:
      severity: warning
      category: storage
    annotations:
      summary: "Disk usage {{ $value | humanizePercentage }}"
      description: "Instance {{ $labels.instance }} mount {{ $labels.mountpoint }} >80 %"

  - alert: HostDiskCritical
    expr: instance:node_filesystem_utilization:ratio > 0.90
    for: 3m
    labels:
      severity: critical
      category: storage
    annotations:
      summary: "Disk critical"
      description: "Instance {{ $labels.instance }} mount {{ $labels.mountpoint }} >90 %"

  - alert: HostDiskWillFillIn24Hours
    expr: predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[6h], 24*3600) < 0
    for: 30m
    labels:
      severity: warning
      category: storage
    annotations:
      summary: "Disk expected to fill in 24 h"
      description: "Instance {{ $labels.instance }} mount {{ $labels.mountpoint }} will run out of space within 24 h."

- name: host_network_alerts
  rules:
  - alert: HostHighNetworkIn
    expr: instance:node_network_receive_bytes:rate5m > 100*1024*1024
    for: 5m
    labels:
      severity: warning
      category: network
    annotations:
      summary: "Inbound traffic high {{ $value | humanize }} B/s"
      description: "Instance {{ $labels.instance }} inbound >100 MiB/s for 5 min."

  - alert: HostHighNetworkOut
    expr: instance:node_network_transmit_bytes:rate5m > 100*1024*1024
    for: 5m
    labels:
      severity: warning
      category: network
    annotations:
      summary: "Outbound traffic high {{ $value | humanize }} B/s"
      description: "Instance {{ $labels.instance }} outbound >100 MiB/s for 5 min."

  - alert: HostNetworkInterfaceDown
    expr: node_network_up{device!~"lo|veth.*|docker.*|br.*"} == 0
    for: 2m
    labels:
      severity: critical
      category: network
    annotations:
      summary: "Network interface {{ $labels.device }} down"
      description: "Instance {{ $labels.instance }} interface {{ $labels.device }} down >2 min."

Best Practices & Pitfalls

Storage Optimization

Default retention is 15 days. For production set --storage.tsdb.retention.time=30d and optionally --storage.tsdb.retention.size=50GB.

Estimated space:

samples_per_interval × 2 bytes × retention_days × 86400 / scrape_interval

. Real usage is ~60‑70 % of the estimate due to compression.

For long‑term storage use remote_write to VictoriaMetrics or Thanos.

# Example remote_write to VictoriaMetrics
remote_write:
- url: "http://victoriametrics:8428/api/v1/write"
  queue_config:
    max_samples_per_send: 10000
    batch_send_deadline: 5s
    max_shards: 30

High Availability

Run two identical Prometheus instances with distinct external_labels.replica values. Use a load balancer or DNS round‑robin for scrape targets.

Alertmanager HA via --cluster.listen-address and --cluster.peer to deduplicate alerts.

# Start two Alertmanager nodes
alertmanager --config.file=/etc/alertmanager/alertmanager.yml \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-02:9094

Security

Enable BasicAuth in /etc/prometheus/web.yml (hash generated with htpasswd -nBC 10 admin).

Enable TLS in the same file or terminate TLS with an Nginx reverse proxy.

# /etc/prometheus/web.yml
basic_auth_users:
  admin: "$2y$10$xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

tls_server_config:
  cert_file: /etc/prometheus/tls/server.crt
  key_file: /etc/prometheus/tls/server.key

Common Pitfalls

Do not use high‑cardinality labels (e.g., user IDs, full URLs). Keep cardinality in the low hundreds.

Scrape interval of 15 s is a good balance; shorter intervals increase storage and query load.

Monitor Prometheus self‑metrics (e.g., scrape_duration_seconds, prometheus_tsdb_head_series) to detect OOM or performance issues.

Troubleshooting

Target Down

Check network connectivity, firewall rules, SELinux, and exporter status with curl http://host:9100/metrics.

Prometheus OOM

Inspect prometheus_tsdb_head_series and scrape_samples_scraped. Use metric_relabel_configs to drop offending high‑cardinality series or delete them via the admin API.

# Delete a problematic metric
curl -X POST "http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=\"problematic_metric\"}"
curl -X POST "http://localhost:9090/api/v1/admin/tsdb/clean_tombstones"

Alert Not Firing

Validate rule loading ( promtool check rules), test the expression in the Prometheus UI, and verify routing with amtool config routes test.

# Test routing for a critical alert
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml severity=critical alertname=HostDown

Appendix

PromQL Cheat Sheet

# Instant vector
metric_name{label="value"}

# Range vector (5 min)
metric_name[5m]

# Rate (per second)
rate(counter[5m])

# Increase over 1 h
increase(counter[1h])

# Aggregation
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

# Top‑k
topk(5, instance:node_cpu_utilization:ratio)

# Histogram quantile (P99)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Predict linear growth (disk fill)
predict_linear(node_filesystem_avail_bytes[6h], 24*3600)

# Absent (detect missing series)
absent(up{job="node", instance="10.0.1.10:9100"})

Common Exporter List

node_exporter – 9100 – Linux host metrics

windows_exporter – 9182 – Windows host metrics

mysqld_exporter – 9104 – MySQL

redis_exporter – 9121 – Redis

postgres_exporter – 9187 – PostgreSQL

mongodb_exporter – 9216 – MongoDB

kafka_exporter – 9308 – Kafka

nginx‑prometheus‑exporter – 9113 – Nginx

blackbox_exporter – 9115 – HTTP probing

process_exporter – 9256 – Process metrics

snmp_exporter – 9116 – SNMP devices

Glossary

Time series : a metric name plus a unique set of labels.

Instant vector : latest sample of each series at a single timestamp.

Range vector : all samples of a series over a time window.

Label : key‑value pair that differentiates dimensions (e.g., instance, env).

Cardinality : number of unique label values; high cardinality hurts performance.

Scrape : Prometheus pulling metrics from a target.

Exporter : component that exposes third‑party metrics in Prometheus format.

Recording rule : pre‑computed query stored as a new metric.

Inhibition : suppress lower‑severity alerts when a higher‑severity alert is active.

Silence : temporary mute of matching alerts.

Federation : hierarchical aggregation of multiple Prometheus instances.

cloud-nativealertingGrafana
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.