Operations 36 min read

Master Prometheus Alerting: Write Rules and Configure Alertmanager for Reliable Notifications

This comprehensive guide walks you through the fundamentals of Prometheus alerting, from crafting PromQL‑driven alert rules and setting up Alertmanager with routing, grouping, inhibition and silencing, to configuring DingTalk and WeChat webhooks, implementing tiered alert strategies, best‑practice performance tuning, security hardening, high‑availability deployment, troubleshooting, and backup‑restore procedures.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Prometheus Alerting: Write Rules and Configure Alertmanager for Reliable Notifications

Overview

Prometheus alone only collects metrics; without a notification layer the monitoring loop is incomplete. Alertmanager provides de‑duplication, grouping, routing, inhibition and silencing, turning raw metric thresholds into actionable alerts that reach the right people.

Supported Scenarios

Infrastructure: host CPU, memory, disk, network, node exporter health.

Application services: HTTP latency, error rate, JVM heap/GC, service availability.

Business indicators: order volume, payment success rate, user registration trends.

Environment Requirements

Prometheus >= 2.45

Alertmanager >= 0.27 (cluster mode fixes)

Linux (CentOS 7+ or Ubuntu 20.04+), 1 CPU + 1 GB RAM is sufficient for the Alertmanager process.

Network ports: 9093 (Alertmanager API/Web UI) and 9094 (cluster gossip, TCP + UDP).

At least one outbound notification channel (e‑mail, DingTalk, Enterprise WeChat, generic webhook).

Installation

Alertmanager binary

# Create a non‑login user
sudo useradd --no-create-home --shell /bin/false alertmanager

# Download the official release (example v0.27.0)
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
cd alertmanager-0.27.0.linux-amd64

# Install binaries
sudo cp alertmanager /usr/local/bin/
sudo cp amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool

# Create configuration and data directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

# Verify version
alertmanager --version

Systemd service

sudo tee /etc/systemd/system/alertmanager.service > /dev/null <<'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093 \
  --web.external-url=http://alertmanager.example.com:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --log.level=info \
  --data.retention=120h
Restart=always
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager

Parameter notes : --data.retention=120h keeps Alertmanager state (silences, notification history) for five days. --cluster.listen-address enables gossip clustering; omit it for a single‑node deployment. --web.external-url must point to a reachable address because alert links use this URL.

Core Configuration

alertmanager.yml

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.example.com:465'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'YOUR_SMTP_PASSWORD'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s          # wait for a short burst before sending
  group_interval: 5m      # minimum interval between notifications for the same group
  repeat_interval: 4h     # repeat non‑resolved alerts every 4 h
  receiver: 'default-webhook'
  routes:
  - match:
      severity: critical
    receiver: 'critical-pager'
    group_wait: 10s
    repeat_interval: 1h
    continue: false
  - match:
      severity: warning
    receiver: 'warning-dingtalk'
    group_wait: 30s
    repeat_interval: 4h
    continue: false
  - match_re:
      job: '(mysql|redis|mongodb).*'
    receiver: 'dba-dingtalk'
    group_wait: 30s
    repeat_interval: 2h
    continue: false
  - match:
      team: order
    receiver: 'order-team-webhook'
    continue: false
  - match:
      team: payment
    receiver: 'payment-team-webhook'
    continue: false

inhibit_rules:
- source_match:
    alertname: 'NodeDown'
  target_match_re:
    alertname: '.+'
  equal: ['instance']
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'instance']

receivers:
- name: 'default-webhook'
  webhook_configs:
  - url: 'http://localhost:8060/dingtalk/ops/send'
    send_resolved: true
- name: 'critical-pager'
  webhook_configs:
  - url: 'http://localhost:8060/dingtalk/critical/send'
    send_resolved: true
  - url: 'http://oncall-api.internal:8080/api/v1/alert'
    send_resolved: true
  email_configs:
  - to: '[email protected]'
    send_resolved: true
    headers:
      Subject: '[P0-CRITICAL] {{ .GroupLabels.alertname }}'
- name: 'warning-dingtalk'
  webhook_configs:
  - url: 'http://localhost:8060/dingtalk/warning/send'
    send_resolved: true
- name: 'dba-dingtalk'
  webhook_configs:
  - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_DBA_TOKEN'
    secret: SEC_YOUR_DBA_SECRET
- name: 'order-team-webhook'
  webhook_configs:
  - url: 'http://localhost:8060/dingtalk/order/send'
    send_resolved: true
- name: 'payment-team-webhook'
  webhook_configs:
  - url: 'http://localhost:8060/dingtalk/payment/send'
    send_resolved: true

The group_by list deliberately excludes instance so that a network outage on a single host does not generate dozens of separate notifications. All alerts carry a severity label; routing is driven entirely by that label.

Alert rule files

Node‑level rules (node_alerts.yml)

# /etc/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
  rules:
  - alert: NodeDown
    expr: up{job="node-exporter"} == 0
    for: 2m
    labels:
      severity: critical
      team: ops
    annotations:
      summary: "Node {{ $labels.instance }} down"
      description: "No scrape for 2 minutes – investigate network or host failure."
  - alert: NodeCPUHigh
    expr: |
      1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.85
    for: 5m
    labels:
      severity: warning
      team: ops
    annotations:
      summary: "{{ $labels.instance }} CPU usage {{ $value | humanizePercentage }}"
      description: "CPU > 85 % for 5 minutes. Check for runaway processes."
  - alert: NodeMemoryHigh
    expr: |
      1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes > 0.90
    for: 5m
    labels:
      severity: warning
      team: ops
    annotations:
      summary: "{{ $labels.instance }} memory {{ $value | humanizePercentage }}"
  - alert: NodeDiskAlmostFull
    expr: |
      1 - node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} /
          node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"} > 0.85
    for: 5m
    labels:
      severity: warning
      team: ops
    annotations:
      summary: "{{ $labels.instance }} disk {{ $value | humanizePercentage }}"
  - alert: NodeDiskWillFull
    expr: |
      predict_linear(node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}[6h], 24*3600) < 0
    for: 10m
    labels:
      severity: warning
      team: ops
    annotations:
      summary: "{{ $labels.instance }} disk predicted to fill in 24 h"
      description: "Current write rate extrapolates to zero free space within a day."
  - alert: NodeNetworkErrors
    expr: |
      rate(node_network_receive_errs_total[5m]) > 10 or
      rate(node_network_transmit_errs_total[5m]) > 10
    for: 5m
    labels:
      severity: warning
      team: ops
    annotations:
      summary: "{{ $labels.instance }} network interface {{ $labels.device }} error packets"

Application‑level rules (app_alerts.yml)

# /etc/prometheus/rules/app_alerts.yml
groups:
- name: app_alerts
  rules:
  - alert: ServiceDown
    expr: up{job=~"app-.*"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Service {{ $labels.job }} instance {{ $labels.instance }} unreachable"
  - alert: HighErrorRate
    expr: |
      sum by(job) (rate(http_requests_total{status=~"5.."}[5m])) /
      sum by(job) (rate(http_requests_total[5m])) > 0.05
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "{{ $labels.job }} 5xx error rate {{ $value | humanizePercentage }}"
      description: "Error rate > 5 % for 3 minutes – check logs."
  - alert: HighLatencyP99
    expr: |
      histogram_quantile(0.99, sum by(job, le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.job }} P99 latency {{ $value | humanizeDuration }}"
  - alert: QPSDropSudden
    expr: |
      sum by(job) (rate(http_requests_total[5m])) <
      sum by(job) (rate(http_requests_total[1h] offset 1d)) * 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.job }} QPS dropped > 50 % vs yesterday"
      description: "Current QPS {{ $value }} – possible traffic anomaly."
  - alert: JVMHeapHigh
    expr: |
      jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} JVM heap {{ $value | humanizePercentage }}"
  - alert: JVMGCTimeHigh
    expr: |
      rate(jvm_gc_pause_seconds_sum[5m]) / rate(jvm_gc_pause_seconds_count[5m]) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} GC avg pause > 500 ms"

Middleware rules (middleware_alerts.yml)

# /etc/prometheus/rules/middleware_alerts.yml
groups:
- name: mysql_alerts
  rules:
  - alert: MySQLDown
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
      team: dba
    annotations:
      summary: "MySQL {{ $labels.instance }} down"
  - alert: MySQLSlowQueries
    expr: rate(mysql_global_status_slow_queries[5m]) > 0.1
    for: 5m
    labels:
      severity: warning
      team: dba
    annotations:
      summary: "MySQL {{ $labels.instance }} slow query rate increased"
  - alert: MySQLConnectionsHigh
    expr: |
      mysql_global_status_threads_connected /
      mysql_global_variables_max_connections > 0.8
    for: 5m
    labels:
      severity: warning
      team: dba
    annotations:
      summary: "MySQL {{ $labels.instance }} connection usage {{ $value | humanizePercentage }}"
- name: redis_alerts
  rules:
  - alert: RedisDown
    expr: redis_up == 0
    for: 1m
    labels:
      severity: critical
      team: dba
    annotations:
      summary: "Redis {{ $labels.instance }} down"
  - alert: RedisMemoryHigh
    expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
    for: 5m
    labels:
      severity: warning
      team: dba
    annotations:
      summary: "Redis {{ $labels.instance }} memory {{ $value | humanizePercentage }}"
  - alert: RedisRejectedConnections
    expr: increase(redis_rejected_connections_total[5m]) > 0
    for: 1m
    labels:
      severity: warning
      team: dba
    annotations:
      summary: "Redis {{ $labels.instance }} connection rejections"

Notification template (DingTalk)

{{/* /etc/alertmanager/templates/dingtalk.tmpl */}}
{{ define "ding.link.title" }}
{{ if eq (index .Alerts 0).Labels.severity "critical" }}[P0-严重]{{ else }}[P1-警告]{{ end }} {{ .GroupLabels.alertname }} ({{ .Alerts | len }}条)
{{ end }}

{{ define "ding.link.content" }}
{{ if eq .Status "firing" }}**🔴 告警触发**{{ else }}**🟢 告警恢复**{{ end }}

**告警名称**: {{ .GroupLabels.alertname }}
**告警级别**: {{ (index .Alerts 0).Labels.severity }}
**告警数量**: {{ .Alerts | len }}条
**触发时间**: {{ (.Alerts.Firing | first).StartsAt.Local.Format "2006-01-02 15:04:05" }}

{{ range .Alerts }}
---
**实例**: {{ .Labels.instance }}
**摘要**: {{ .Annotations.summary }}
**详情**: {{ .Annotations.description }}
{{ end }}

[查看详情]({{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }})
{{ end }}

Backup & Recovery

Backup script

#!/bin/bash
# /opt/scripts/alertmanager_backup.sh
BACKUP_DIR="/data/backup/alertmanager"
DATE=$(date +%Y%m%d)
mkdir -p "${BACKUP_DIR}"
# Config and rule files
tar czf "${BACKUP_DIR}/alertmanager_config_${DATE}.tar.gz" /etc/alertmanager/ /etc/prometheus/rules/
# Runtime state (silences, notification history)
tar czf "${BACKUP_DIR}/alertmanager_data_${DATE}.tar.gz" /var/lib/alertmanager/
# Keep only the last 30 days
find "${BACKUP_DIR}" -name "*.tar.gz" -mtime +30 -delete

Recovery steps

Stop Alertmanager: sudo systemctl stop alertmanager Restore configuration:

tar xzf /data/backup/alertmanager/alertmanager_config_YYYYMMDD.tar.gz -C /

Restore state data:

tar xzf /data/backup/alertmanager/alertmanager_data_YYYYMMDD.tar.gz -C /

Start Alertmanager:

sudo systemctl start alertmanager

Best Practices

Rule grouping : Split large rule sets into logical groups (e.g., node, app, middleware). Prometheus evaluates groups in parallel; 600 rules in 12 groups reduced evaluation time from ~3 s to <1 s.

Avoid heavy PromQL : Functions like count(), group_left, label_replace are CPU‑intensive. Use recording rules to pre‑aggregate data whenever possible.

Evaluation interval tuning : Default 15 s is fine for fast‑changing metrics. For low‑frequency checks (disk space, certificate expiry) set evaluation_interval: 60s in the rule group.

Alertmanager grouping : Keep group_by to ['alertname','cluster']. Adding instance creates a separate notification per host, which often leads to alert storms.

Basic authentication : Protect the Alertmanager UI and API with basic_auth_users in /etc/alertmanager/web.yml to prevent unauthorized silences.

Internal webhook endpoints : Run DingTalk/WeChat webhook adapters on private network addresses; never expose the raw access token to the internet.

Sensitive data masking : Templates should omit passwords, connection strings, or internal IPs. Only expose identifiers needed for troubleshooting.

Silence audit : Periodically list silences older than 7 days with amtool silence query and alert on stale entries.

High Availability

Deploy at least three Alertmanager instances in cluster mode (gossip on port 9094, both TCP and UDP). The cluster automatically replicates silences and notification state.

Example systemd ExecStart for a node (replace 10.0.1.50 etc. with your IPs):

alertmanager \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=10.0.1.51:9094 \
  --cluster.peer=10.0.1.52:9094

Critical alerts should have at least two independent receivers (e.g., DingTalk + phone/SMS) to survive a single channel outage.

Back up --storage.path regularly (see backup script) and store backups off‑site.

Troubleshooting

View Alertmanager logs: sudo journalctl -u alertmanager -f Validate configuration syntax: amtool check-config /etc/alertmanager/alertmanager.yml Validate rule files: promtool check rules /etc/prometheus/rules/*.yml Test routing for a synthetic alert:

amtool --alertmanager.url=http://localhost:9093 config routes test \
  severity=warning alertname=TestAlert instance=test-node:9100

Common errors:

Rule never fires – check the PromQL expression in the Prometheus UI and verify the for duration.

Duplicate notifications – group_by is too granular (e.g., includes instance).

Missing resolve notifications – ensure send_resolved: true is set in the receiver.

Cluster state diverges – verify that port 9094 is open for both TCP and UDP on all nodes.

Monitoring Alertmanager

Expose the built‑in /metrics endpoint and watch the following key series: alertmanager_notifications_failed_total – should stay at 0. alertmanager_notification_latency_seconds – typical latency < 5 s; alert if > 30 s. prometheus_rule_group_duration_seconds – rule evaluation should be < 1 s; alert if > 5 s. alertmanager_alerts – active alert count; set a threshold based on cluster size (e.g., > 500 may indicate a problem).

Next‑Step Topics

Auto‑remediation : Use Alertmanager webhooks to trigger scripts that automatically restart a crashed service or clean up log files.

AIOps‑driven predictive alerts : Extend the simple predict_linear usage with external ML models for anomaly detection.

On‑call rotation integration : Connect Alertmanager to PagerDuty, OpsGenie or a custom on‑call service to automate escalation and rotation.

DevOpsalertingPrometheusAlertmanagerAlert Rules
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.