Build a Production‑Ready Prometheus + Grafana Monitoring Stack in Minutes
Learn how to quickly set up a complete, production‑grade monitoring system using Prometheus 3.x and Grafana 11, covering installation, service discovery, PromQL queries, recording rules, Alertmanager routing, Grafana dashboards, best‑practice configurations, and troubleshooting for environments of any size.
Overview
This guide shows how to build a production‑ready monitoring stack with Prometheus 3.x, Alertmanager 0.28, and Grafana 11.5. It covers installation, service discovery, metric collection, alerting, and visualization.
Architecture Overview
+-------------------+ +-------------------+ +-------------------+
| Prometheus 1 | ---> | Alertmanager | ---> | Notification |
+-------------------+ +-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Prometheus 2 | ---> | Grafana UI |
+-------------------+ +-------------------+Prometheus scrapes metrics, stores them in a TSDB, and serves queries to Grafana. Alertmanager receives alerts, applies grouping, inhibition, and routing, then forwards them to external channels.
Environment Requirements
OS: Ubuntu 24.04 LTS or Rocky Linux 9.x (Ubuntu recommended)
Prometheus 3.2.x (2025‑end GA) – 15 s scrape interval, TSDB compression ~30 %
Grafana 11.5.x – native Prometheus data source
Alertmanager 0.28.x – routing, inhibition, silencing
node_exporter 1.9.x – host metrics
Installation Steps
1. Install Prometheus
# Create system user
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus
# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
# Download and extract binary (v3.2.1 as of 2026‑03)
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v3.2.1/prometheus-3.2.1.linux-amd64.tar.gz
tar xvfz prometheus-3.2.1.linux-amd64.tar.gz
# Install binaries
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
# Copy console files
sudo cp -r consoles /etc/prometheus/
sudo cp -r console_libraries /etc/prometheus/
# Verify
prometheus --version2. Install node_exporter on all targets
# Create node_exporter user
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter
# Download and extract
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.9.0/node_exporter-1.9.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.9.0.linux-amd64.tar.gz
# Install binary
sudo cp node_exporter-1.9.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
# Systemd service
cat <<'EOF' >/etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
# Verify
curl -s http://localhost:9100/metrics | head -203. Install Alertmanager
# Download and extract
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz
tar xvfz alertmanager-0.28.1.linux-amd64.tar.gz
# Install binaries
sudo cp alertmanager-0.28.1.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.28.1.linux-amd64/amtool /usr/local/bin/
# Create user and directories
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager
# Systemd service
cat <<'EOF' >/etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=:9093
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanagerService Discovery
Prometheus can discover targets via static files, file‑SD, or Consul. Choose the method that matches the size of your environment.
Static Config (≤10 nodes)
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["192.168.1.10:9100", "192.168.1.11:9100"]
labels:
env: "production"
dc: "bj-01"File‑SD (10‑200 nodes)
scrape_configs:
- job_name: "node"
file_sd_configs:
- files: ["/etc/prometheus/file_sd/nodes.yml"]
refresh_interval: 30sExample /etc/prometheus/file_sd/nodes.yml:
- targets:
- "10.0.1.10:9100"
- "10.0.1.11:9100"
labels:
env: "production"
group: "web"
dc: "bj-01"Consul‑SD (≥200 nodes)
scrape_configs:
- job_name: "node"
consul_sd_configs:
- server: "consul.example.com:8500"
services: []
tags: ["monitoring"]
relabel_configs:
- source_labels: ["__meta_consul_tags"]
regex: ".*,env=([^,]+),.*"
target_label: "env"
replacement: "$1"
- source_labels: ["__meta_consul_service"]
target_label: "job"
- source_labels: ["__meta_consul_dc"]
target_label: "dc"PromQL Core Queries
Instant vector : node_cpu_seconds_total{mode!="idle"} Range vector (5 min) : rate(node_network_receive_bytes_total[5m]) Aggregation example :
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100Recording Rules (Pre‑computed Metrics)
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: node_recording_rules
interval: 15s
rules:
- record: instance:node_cpu_utilization:ratio
expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: instance:node_memory_utilization:ratio
expr: 1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes
- record: instance:node_filesystem_utilization:ratio
expr: 1 - node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"}
- record: instance:node_network_receive_bytes:rate5m
expr: sum by (instance) (rate(node_network_receive_bytes_total{device!="lo"}[5m]))
- record: instance:node_network_transmit_bytes:rate5m
expr: sum by (instance) (rate(node_network_transmit_bytes_total{device!="lo"}[5m]))
- record: instance:node_disk_io_utilization:ratio
expr: max by (instance) (rate(node_disk_io_time_seconds_total[5m]))Validate with
promtool check rules /etc/prometheus/rules/recording_rules.ymland reload via curl -X POST http://localhost:9090/-/reload.
Alertmanager Configuration
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: "smtp.example.com:465"
smtp_from: "[email protected]"
smtp_auth_username: "[email protected]"
smtp_auth_password: "your-smtp-password"
smtp_require_tls: false
inhibit_rules:
- source_matchers:
- "alertname=\"HostDown\""
target_matchers:
- "severity=~\"warning|info\""
equal: ["instance"]
- source_matchers:
- "severity=\"critical\""
target_matchers:
- "severity=\"warning\""
equal: ["alertname", "instance"]
route:
receiver: "default-email"
group_by: ["alertname", "instance", "category"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- "severity=\"critical\""
receiver: "critical-all-channels"
group_wait: 10s
repeat_interval: 1h
- matchers:
- "category=\"storage\""
- "severity=\"warning\""
receiver: "storage-warning"
repeat_interval: 2h
- matchers:
- "severity=\"warning\""
receiver: "warning-wecom"
repeat_interval: 4h
- matchers:
- "severity=\"info\""
receiver: "default-email"
repeat_interval: 12h
receivers:
- name: "default-email"
email_configs:
- to: "[email protected]"
send_resolved: true
headers:
subject: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} - {{ .GroupLabels.instance }}"
- name: "critical-all-channels"
email_configs:
- to: "[email protected], [email protected]"
send_resolved: true
webhook_configs:
- url: "http://localhost:8060/dingtalk/ops_critical/send"
send_resolved: true
- url: "http://localhost:8061/wecom/send"
send_resolved: true
- url: "http://localhost:8062/feishu/send"
send_resolved: true
- name: "storage-warning"
email_configs:
- to: "[email protected]"
send_resolved: true
webhook_configs:
- url: "http://localhost:8061/wecom/send"
send_resolved: true
- name: "warning-wecom"
webhook_configs:
- url: "http://localhost:8061/wecom/send"
send_resolved: trueGrafana Installation & Data Source
Ubuntu 24.04 (APT)
# Install dependencies
sudo apt install -y apt-transport-https software-properties-common wget
# Add GPG key and repo
sudo mkdir -p /etc/apt/keyrings
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
# Install Grafana 11.5
sudo apt update
sudo apt install -y grafana
sudo systemctl daemon-reload
sudo systemctl enable --now grafana-serverRocky Linux 9 (YUM)
# Add repo file
cat > /etc/yum.repos.d/grafana.repo <<'EOF'
[grafana]
name=grafana
baseurl=https://rpm.grafana.com
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://rpm.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
sudo dnf install -y grafana
sudo systemctl enable --now grafana-serverPrometheus Data Source
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
httpMethod: POSTDashboard Design & Import
Import community dashboards by ID (e.g., 1860 for Node Exporter Full, 11074 for a lightweight view). Use the following IDs for common services:
Node Exporter Full – 1860
Node Exporter – 11074
Docker Container Monitoring – 893
MySQL Overview – 7362
Redis Dashboard – 763
Nginx Dashboard – 12708
Custom host overview dashboard typically includes panels for CPU, memory, disk usage, and network traffic. Use variables $instance, $env, and $dc to filter across environments.
Full Example Configurations
prometheus.yml (≈50 nodes)
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "prod-bj"
environment: "production"
region: "cn-north-1"
rule_files:
- "/etc/prometheus/rules/recording_rules.yml"
- "/etc/prometheus/rules/alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
scrape_configs:
# Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
labels:
component: "monitoring"
# Alertmanager
- job_name: "alertmanager"
static_configs:
- targets: ["localhost:9093"]
labels:
component: "monitoring"
# Host groups via file_sd (example for web servers)
- job_name: "node-web"
file_sd_configs:
- files: ["/etc/prometheus/file_sd/web_servers.yml"]
refresh_interval: 30s
relabel_configs:
- source_labels: ["__address__"]
regex: "(.+):\\d+"
target_label: "hostname"
replacement: "$1"
# Additional groups (app, db, cache) follow the same pattern
# MySQL exporter
- job_name: "mysql"
file_sd_configs:
- files: ["/etc/prometheus/file_sd/mysql_exporters.yml"]
refresh_interval: 30s
relabel_configs:
- source_labels: ["__address__"]
regex: "(.+):\\d+"
target_label: "db_host"
replacement: "$1"
# Redis exporter
- job_name: "redis"
file_sd_configs:
- files: ["/etc/prometheus/file_sd/redis_exporters.yml"]
refresh_interval: 30s
# Nginx exporter
- job_name: "nginx"
file_sd_configs:
- files: ["/etc/prometheus/file_sd/nginx_exporters.yml"]
refresh_interval: 30s
# Custom application metrics
- job_name: "app-metrics"
metrics_path: "/metrics"
file_sd_configs:
- files: ["/etc/prometheus/file_sd/app_metrics.yml"]
refresh_interval: 30s
# Blackbox HTTP probing
- job_name: "blackbox-http"
metrics_path: "/probe"
params:
module: ["http_2xx"]
static_configs:
- targets:
- "https://www.example.com"
- "https://api.example.com/health"
- "https://admin.example.com"
relabel_configs:
- source_labels: ["__address__"]
target_label: "__param_target"
- source_labels: ["__param_target"]
target_label: "instance"
- target_label: "__address__"
replacement: "localhost:9115"alerts.yml (selected rules)
# /etc/prometheus/rules/alerts.yml
groups:
- name: host_basic_alerts
rules:
- alert: HostDown
expr: up{job=~"node.*"} == 0
for: 1m
labels:
severity: critical
category: availability
annotations:
summary: "Host {{ $labels.instance }} is unreachable"
description: "Job {{ $labels.job }} / instance {{ $labels.instance }} has been down for >1 min."
- alert: HostHighCpuUsage
expr: instance:node_cpu_utilization:ratio > 0.85
for: 5m
labels:
severity: warning
category: performance
annotations:
summary: "CPU usage {{ $value | humanizePercentage }}"
description: "Instance {{ $labels.instance }} CPU >85 % for 5 min."
- alert: HostCriticalCpuUsage
expr: instance:node_cpu_utilization:ratio > 0.95
for: 3m
labels:
severity: critical
category: performance
annotations:
summary: "CPU usage critical {{ $value | humanizePercentage }}"
description: "Instance {{ $labels.instance }} CPU >95 % for 3 min."
- alert: HostHighMemoryUsage
expr: instance:node_memory_utilization:ratio > 0.85
for: 5m
labels:
severity: warning
category: performance
annotations:
summary: "Memory usage {{ $value | humanizePercentage }}"
description: "Instance {{ $labels.instance }} memory >85 % for 5 min."
- alert: HostCriticalMemoryUsage
expr: instance:node_memory_utilization:ratio > 0.95
for: 3m
labels:
severity: critical
category: performance
annotations:
summary: "Memory critical"
description: "Instance {{ $labels.instance }} memory >95 % for 3 min."
- name: host_disk_alerts
rules:
- alert: HostDiskWarning
expr: instance:node_filesystem_utilization:ratio > 0.80
for: 5m
labels:
severity: warning
category: storage
annotations:
summary: "Disk usage {{ $value | humanizePercentage }}"
description: "Instance {{ $labels.instance }} mount {{ $labels.mountpoint }} >80 %"
- alert: HostDiskCritical
expr: instance:node_filesystem_utilization:ratio > 0.90
for: 3m
labels:
severity: critical
category: storage
annotations:
summary: "Disk critical"
description: "Instance {{ $labels.instance }} mount {{ $labels.mountpoint }} >90 %"
- alert: HostDiskWillFillIn24Hours
expr: predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
category: storage
annotations:
summary: "Disk expected to fill in 24 h"
description: "Instance {{ $labels.instance }} mount {{ $labels.mountpoint }} will run out of space within 24 h."
- name: host_network_alerts
rules:
- alert: HostHighNetworkIn
expr: instance:node_network_receive_bytes:rate5m > 100*1024*1024
for: 5m
labels:
severity: warning
category: network
annotations:
summary: "Inbound traffic high {{ $value | humanize }} B/s"
description: "Instance {{ $labels.instance }} inbound >100 MiB/s for 5 min."
- alert: HostHighNetworkOut
expr: instance:node_network_transmit_bytes:rate5m > 100*1024*1024
for: 5m
labels:
severity: warning
category: network
annotations:
summary: "Outbound traffic high {{ $value | humanize }} B/s"
description: "Instance {{ $labels.instance }} outbound >100 MiB/s for 5 min."
- alert: HostNetworkInterfaceDown
expr: node_network_up{device!~"lo|veth.*|docker.*|br.*"} == 0
for: 2m
labels:
severity: critical
category: network
annotations:
summary: "Network interface {{ $labels.device }} down"
description: "Instance {{ $labels.instance }} interface {{ $labels.device }} down >2 min."Best Practices & Pitfalls
Storage Optimization
Default retention is 15 days. For production set --storage.tsdb.retention.time=30d and optionally --storage.tsdb.retention.size=50GB.
Estimated space:
samples_per_interval × 2 bytes × retention_days × 86400 / scrape_interval. Real usage is ~60‑70 % of the estimate due to compression.
For long‑term storage use remote_write to VictoriaMetrics or Thanos.
# Example remote_write to VictoriaMetrics
remote_write:
- url: "http://victoriametrics:8428/api/v1/write"
queue_config:
max_samples_per_send: 10000
batch_send_deadline: 5s
max_shards: 30High Availability
Run two identical Prometheus instances with distinct external_labels.replica values. Use a load balancer or DNS round‑robin for scrape targets.
Alertmanager HA via --cluster.listen-address and --cluster.peer to deduplicate alerts.
# Start two Alertmanager nodes
alertmanager --config.file=/etc/alertmanager/alertmanager.yml \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-02:9094Security
Enable BasicAuth in /etc/prometheus/web.yml (hash generated with htpasswd -nBC 10 admin).
Enable TLS in the same file or terminate TLS with an Nginx reverse proxy.
# /etc/prometheus/web.yml
basic_auth_users:
admin: "$2y$10$xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
tls_server_config:
cert_file: /etc/prometheus/tls/server.crt
key_file: /etc/prometheus/tls/server.keyCommon Pitfalls
Do not use high‑cardinality labels (e.g., user IDs, full URLs). Keep cardinality in the low hundreds.
Scrape interval of 15 s is a good balance; shorter intervals increase storage and query load.
Monitor Prometheus self‑metrics (e.g., scrape_duration_seconds, prometheus_tsdb_head_series) to detect OOM or performance issues.
Troubleshooting
Target Down
Check network connectivity, firewall rules, SELinux, and exporter status with curl http://host:9100/metrics.
Prometheus OOM
Inspect prometheus_tsdb_head_series and scrape_samples_scraped. Use metric_relabel_configs to drop offending high‑cardinality series or delete them via the admin API.
# Delete a problematic metric
curl -X POST "http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__=\"problematic_metric\"}"
curl -X POST "http://localhost:9090/api/v1/admin/tsdb/clean_tombstones"Alert Not Firing
Validate rule loading ( promtool check rules), test the expression in the Prometheus UI, and verify routing with amtool config routes test.
# Test routing for a critical alert
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml severity=critical alertname=HostDownAppendix
PromQL Cheat Sheet
# Instant vector
metric_name{label="value"}
# Range vector (5 min)
metric_name[5m]
# Rate (per second)
rate(counter[5m])
# Increase over 1 h
increase(counter[1h])
# Aggregation
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
# Top‑k
topk(5, instance:node_cpu_utilization:ratio)
# Histogram quantile (P99)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Predict linear growth (disk fill)
predict_linear(node_filesystem_avail_bytes[6h], 24*3600)
# Absent (detect missing series)
absent(up{job="node", instance="10.0.1.10:9100"})Common Exporter List
node_exporter – 9100 – Linux host metrics
windows_exporter – 9182 – Windows host metrics
mysqld_exporter – 9104 – MySQL
redis_exporter – 9121 – Redis
postgres_exporter – 9187 – PostgreSQL
mongodb_exporter – 9216 – MongoDB
kafka_exporter – 9308 – Kafka
nginx‑prometheus‑exporter – 9113 – Nginx
blackbox_exporter – 9115 – HTTP probing
process_exporter – 9256 – Process metrics
snmp_exporter – 9116 – SNMP devices
Glossary
Time series : a metric name plus a unique set of labels.
Instant vector : latest sample of each series at a single timestamp.
Range vector : all samples of a series over a time window.
Label : key‑value pair that differentiates dimensions (e.g., instance, env).
Cardinality : number of unique label values; high cardinality hurts performance.
Scrape : Prometheus pulling metrics from a target.
Exporter : component that exposes third‑party metrics in Prometheus format.
Recording rule : pre‑computed query stored as a new metric.
Inhibition : suppress lower‑severity alerts when a higher‑severity alert is active.
Silence : temporary mute of matching alerts.
Federation : hierarchical aggregation of multiple Prometheus instances.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
