Master Prometheus Alerting: Write Rules and Configure Alertmanager for Reliable Notifications
This comprehensive guide walks you through the fundamentals of Prometheus alerting, from crafting PromQL‑driven alert rules and setting up Alertmanager with routing, grouping, inhibition and silencing, to configuring DingTalk and WeChat webhooks, implementing tiered alert strategies, best‑practice performance tuning, security hardening, high‑availability deployment, troubleshooting, and backup‑restore procedures.
Overview
Prometheus alone only collects metrics; without a notification layer the monitoring loop is incomplete. Alertmanager provides de‑duplication, grouping, routing, inhibition and silencing, turning raw metric thresholds into actionable alerts that reach the right people.
Supported Scenarios
Infrastructure: host CPU, memory, disk, network, node exporter health.
Application services: HTTP latency, error rate, JVM heap/GC, service availability.
Business indicators: order volume, payment success rate, user registration trends.
Environment Requirements
Prometheus >= 2.45
Alertmanager >= 0.27 (cluster mode fixes)
Linux (CentOS 7+ or Ubuntu 20.04+), 1 CPU + 1 GB RAM is sufficient for the Alertmanager process.
Network ports: 9093 (Alertmanager API/Web UI) and 9094 (cluster gossip, TCP + UDP).
At least one outbound notification channel (e‑mail, DingTalk, Enterprise WeChat, generic webhook).
Installation
Alertmanager binary
# Create a non‑login user
sudo useradd --no-create-home --shell /bin/false alertmanager
# Download the official release (example v0.27.0)
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
cd alertmanager-0.27.0.linux-amd64
# Install binaries
sudo cp alertmanager /usr/local/bin/
sudo cp amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
# Create configuration and data directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
# Verify version
alertmanager --versionSystemd service
sudo tee /etc/systemd/system/alertmanager.service > /dev/null <<'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=0.0.0.0:9093 \
--web.external-url=http://alertmanager.example.com:9093 \
--cluster.listen-address=0.0.0.0:9094 \
--log.level=info \
--data.retention=120h
Restart=always
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanagerParameter notes : --data.retention=120h keeps Alertmanager state (silences, notification history) for five days. --cluster.listen-address enables gossip clustering; omit it for a single‑node deployment. --web.external-url must point to a reachable address because alert links use this URL.
Core Configuration
alertmanager.yml
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '[email protected]'
smtp_smarthost: 'smtp.example.com:465'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'YOUR_SMTP_PASSWORD'
smtp_require_tls: false
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname', 'cluster']
group_wait: 30s # wait for a short burst before sending
group_interval: 5m # minimum interval between notifications for the same group
repeat_interval: 4h # repeat non‑resolved alerts every 4 h
receiver: 'default-webhook'
routes:
- match:
severity: critical
receiver: 'critical-pager'
group_wait: 10s
repeat_interval: 1h
continue: false
- match:
severity: warning
receiver: 'warning-dingtalk'
group_wait: 30s
repeat_interval: 4h
continue: false
- match_re:
job: '(mysql|redis|mongodb).*'
receiver: 'dba-dingtalk'
group_wait: 30s
repeat_interval: 2h
continue: false
- match:
team: order
receiver: 'order-team-webhook'
continue: false
- match:
team: payment
receiver: 'payment-team-webhook'
continue: false
inhibit_rules:
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '.+'
equal: ['instance']
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
receivers:
- name: 'default-webhook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/ops/send'
send_resolved: true
- name: 'critical-pager'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/critical/send'
send_resolved: true
- url: 'http://oncall-api.internal:8080/api/v1/alert'
send_resolved: true
email_configs:
- to: '[email protected]'
send_resolved: true
headers:
Subject: '[P0-CRITICAL] {{ .GroupLabels.alertname }}'
- name: 'warning-dingtalk'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/warning/send'
send_resolved: true
- name: 'dba-dingtalk'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_DBA_TOKEN'
secret: SEC_YOUR_DBA_SECRET
- name: 'order-team-webhook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/order/send'
send_resolved: true
- name: 'payment-team-webhook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/payment/send'
send_resolved: trueThe group_by list deliberately excludes instance so that a network outage on a single host does not generate dozens of separate notifications. All alerts carry a severity label; routing is driven entirely by that label.
Alert rule files
Node‑level rules (node_alerts.yml)
# /etc/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 2m
labels:
severity: critical
team: ops
annotations:
summary: "Node {{ $labels.instance }} down"
description: "No scrape for 2 minutes – investigate network or host failure."
- alert: NodeCPUHigh
expr: |
1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.85
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "{{ $labels.instance }} CPU usage {{ $value | humanizePercentage }}"
description: "CPU > 85 % for 5 minutes. Check for runaway processes."
- alert: NodeMemoryHigh
expr: |
1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes > 0.90
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "{{ $labels.instance }} memory {{ $value | humanizePercentage }}"
- alert: NodeDiskAlmostFull
expr: |
1 - node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} /
node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"} > 0.85
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "{{ $labels.instance }} disk {{ $value | humanizePercentage }}"
- alert: NodeDiskWillFull
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}[6h], 24*3600) < 0
for: 10m
labels:
severity: warning
team: ops
annotations:
summary: "{{ $labels.instance }} disk predicted to fill in 24 h"
description: "Current write rate extrapolates to zero free space within a day."
- alert: NodeNetworkErrors
expr: |
rate(node_network_receive_errs_total[5m]) > 10 or
rate(node_network_transmit_errs_total[5m]) > 10
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "{{ $labels.instance }} network interface {{ $labels.device }} error packets"Application‑level rules (app_alerts.yml)
# /etc/prometheus/rules/app_alerts.yml
groups:
- name: app_alerts
rules:
- alert: ServiceDown
expr: up{job=~"app-.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} instance {{ $labels.instance }} unreachable"
- alert: HighErrorRate
expr: |
sum by(job) (rate(http_requests_total{status=~"5.."}[5m])) /
sum by(job) (rate(http_requests_total[5m])) > 0.05
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} 5xx error rate {{ $value | humanizePercentage }}"
description: "Error rate > 5 % for 3 minutes – check logs."
- alert: HighLatencyP99
expr: |
histogram_quantile(0.99, sum by(job, le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.job }} P99 latency {{ $value | humanizeDuration }}"
- alert: QPSDropSudden
expr: |
sum by(job) (rate(http_requests_total[5m])) <
sum by(job) (rate(http_requests_total[1h] offset 1d)) * 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.job }} QPS dropped > 50 % vs yesterday"
description: "Current QPS {{ $value }} – possible traffic anomaly."
- alert: JVMHeapHigh
expr: |
jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} JVM heap {{ $value | humanizePercentage }}"
- alert: JVMGCTimeHigh
expr: |
rate(jvm_gc_pause_seconds_sum[5m]) / rate(jvm_gc_pause_seconds_count[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} GC avg pause > 500 ms"Middleware rules (middleware_alerts.yml)
# /etc/prometheus/rules/middleware_alerts.yml
groups:
- name: mysql_alerts
rules:
- alert: MySQLDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
team: dba
annotations:
summary: "MySQL {{ $labels.instance }} down"
- alert: MySQLSlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 0.1
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "MySQL {{ $labels.instance }} slow query rate increased"
- alert: MySQLConnectionsHigh
expr: |
mysql_global_status_threads_connected /
mysql_global_variables_max_connections > 0.8
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "MySQL {{ $labels.instance }} connection usage {{ $value | humanizePercentage }}"
- name: redis_alerts
rules:
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
team: dba
annotations:
summary: "Redis {{ $labels.instance }} down"
- alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
for: 5m
labels:
severity: warning
team: dba
annotations:
summary: "Redis {{ $labels.instance }} memory {{ $value | humanizePercentage }}"
- alert: RedisRejectedConnections
expr: increase(redis_rejected_connections_total[5m]) > 0
for: 1m
labels:
severity: warning
team: dba
annotations:
summary: "Redis {{ $labels.instance }} connection rejections"Notification template (DingTalk)
{{/* /etc/alertmanager/templates/dingtalk.tmpl */}}
{{ define "ding.link.title" }}
{{ if eq (index .Alerts 0).Labels.severity "critical" }}[P0-严重]{{ else }}[P1-警告]{{ end }} {{ .GroupLabels.alertname }} ({{ .Alerts | len }}条)
{{ end }}
{{ define "ding.link.content" }}
{{ if eq .Status "firing" }}**🔴 告警触发**{{ else }}**🟢 告警恢复**{{ end }}
**告警名称**: {{ .GroupLabels.alertname }}
**告警级别**: {{ (index .Alerts 0).Labels.severity }}
**告警数量**: {{ .Alerts | len }}条
**触发时间**: {{ (.Alerts.Firing | first).StartsAt.Local.Format "2006-01-02 15:04:05" }}
{{ range .Alerts }}
---
**实例**: {{ .Labels.instance }}
**摘要**: {{ .Annotations.summary }}
**详情**: {{ .Annotations.description }}
{{ end }}
[查看详情]({{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }})
{{ end }}Backup & Recovery
Backup script
#!/bin/bash
# /opt/scripts/alertmanager_backup.sh
BACKUP_DIR="/data/backup/alertmanager"
DATE=$(date +%Y%m%d)
mkdir -p "${BACKUP_DIR}"
# Config and rule files
tar czf "${BACKUP_DIR}/alertmanager_config_${DATE}.tar.gz" /etc/alertmanager/ /etc/prometheus/rules/
# Runtime state (silences, notification history)
tar czf "${BACKUP_DIR}/alertmanager_data_${DATE}.tar.gz" /var/lib/alertmanager/
# Keep only the last 30 days
find "${BACKUP_DIR}" -name "*.tar.gz" -mtime +30 -deleteRecovery steps
Stop Alertmanager: sudo systemctl stop alertmanager Restore configuration:
tar xzf /data/backup/alertmanager/alertmanager_config_YYYYMMDD.tar.gz -C /Restore state data:
tar xzf /data/backup/alertmanager/alertmanager_data_YYYYMMDD.tar.gz -C /Start Alertmanager:
sudo systemctl start alertmanagerBest Practices
Rule grouping : Split large rule sets into logical groups (e.g., node, app, middleware). Prometheus evaluates groups in parallel; 600 rules in 12 groups reduced evaluation time from ~3 s to <1 s.
Avoid heavy PromQL : Functions like count(), group_left, label_replace are CPU‑intensive. Use recording rules to pre‑aggregate data whenever possible.
Evaluation interval tuning : Default 15 s is fine for fast‑changing metrics. For low‑frequency checks (disk space, certificate expiry) set evaluation_interval: 60s in the rule group.
Alertmanager grouping : Keep group_by to ['alertname','cluster']. Adding instance creates a separate notification per host, which often leads to alert storms.
Basic authentication : Protect the Alertmanager UI and API with basic_auth_users in /etc/alertmanager/web.yml to prevent unauthorized silences.
Internal webhook endpoints : Run DingTalk/WeChat webhook adapters on private network addresses; never expose the raw access token to the internet.
Sensitive data masking : Templates should omit passwords, connection strings, or internal IPs. Only expose identifiers needed for troubleshooting.
Silence audit : Periodically list silences older than 7 days with amtool silence query and alert on stale entries.
High Availability
Deploy at least three Alertmanager instances in cluster mode (gossip on port 9094, both TCP and UDP). The cluster automatically replicates silences and notification state.
Example systemd ExecStart for a node (replace 10.0.1.50 etc. with your IPs):
alertmanager \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=10.0.1.51:9094 \
--cluster.peer=10.0.1.52:9094Critical alerts should have at least two independent receivers (e.g., DingTalk + phone/SMS) to survive a single channel outage.
Back up --storage.path regularly (see backup script) and store backups off‑site.
Troubleshooting
View Alertmanager logs: sudo journalctl -u alertmanager -f Validate configuration syntax: amtool check-config /etc/alertmanager/alertmanager.yml Validate rule files: promtool check rules /etc/prometheus/rules/*.yml Test routing for a synthetic alert:
amtool --alertmanager.url=http://localhost:9093 config routes test \
severity=warning alertname=TestAlert instance=test-node:9100Common errors:
Rule never fires – check the PromQL expression in the Prometheus UI and verify the for duration.
Duplicate notifications – group_by is too granular (e.g., includes instance).
Missing resolve notifications – ensure send_resolved: true is set in the receiver.
Cluster state diverges – verify that port 9094 is open for both TCP and UDP on all nodes.
Monitoring Alertmanager
Expose the built‑in /metrics endpoint and watch the following key series: alertmanager_notifications_failed_total – should stay at 0. alertmanager_notification_latency_seconds – typical latency < 5 s; alert if > 30 s. prometheus_rule_group_duration_seconds – rule evaluation should be < 1 s; alert if > 5 s. alertmanager_alerts – active alert count; set a threshold based on cluster size (e.g., > 500 may indicate a problem).
Next‑Step Topics
Auto‑remediation : Use Alertmanager webhooks to trigger scripts that automatically restart a crashed service or clean up log files.
AIOps‑driven predictive alerts : Extend the simple predict_linear usage with external ML models for anomaly detection.
On‑call rotation integration : Connect Alertmanager to PagerDuty, OpsGenie or a custom on‑call service to automate escalation and rotation.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
