Enterprise Monitoring with Prometheus: Rule Hierarchy and Alertmanager Notification Orchestration
This guide explains how to turn a fully built Prometheus monitoring system into a closed‑loop alerting solution by designing layered PromQL rules, configuring Alertmanager routing, grouping, inhibition and silencing, integrating DingTalk and WeChat webhooks, and applying best‑practice performance, security, high‑availability, and troubleshooting techniques.
Overview
Monitoring without alerts is useless. Prometheus evaluates alert rules via PromQL and forwards triggered alerts to Alertmanager, which handles deduplication, grouping, routing, silencing, and notification through email, DingTalk, Enterprise WeChat, or custom webhooks.
PromQL‑driven rules : examples such as "CPU usage > 85% for 5 min", "disk will fill in 24 h", and "HTTP error rate 3× normal".
Routing tree + grouping : alerts are grouped by ['alertname','cluster'] to avoid alert storms; a network fault that generated 80 alerts was reduced to 3 notifications.
Inhibition and silencing : node‑down alerts suppress service alerts; silences can be applied during maintenance windows.
Environment Requirements
Prometheus >= 2.45 (alert rule evaluation engine; must match Alertmanager version).
Alertmanager >= 0.27 (fixes key bugs in cluster mode).
OS: CentOS 7+ / Ubuntu 20.04+ (Alertmanager runs on 1 CPU + 1 GB RAM).
Network: Prometheus → Alertmanager on TCP 9093; Alertmanager cluster gossip on TCP/UDP 9094.
Notification channels: at least one of Email, DingTalk, Enterprise WeChat; redundancy recommended.
Installation and Configuration
Alertmanager installation
# create user
sudo useradd --no-create-home --shell /bin/false alertmanager
# download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
cd alertmanager-0.27.0.linux-amd64
# install binaries
sudo cp alertmanager /usr/local/bin/
sudo cp amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
# create config directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
# verify version
alertmanager --versionSystemd service
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager \
--web.listen-address=0.0.0.0:9093 \
--web.external-url=http://alertmanager.example.com:9093 \
--cluster.listen-address=0.0.0.0:9094 \
--log.level=info \
--data.retention=120h
Restart=always
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOFAlertmanager main configuration (alertmanager.yml)
global:
resolve_timeout: 5m
smtp_from: '[email protected]'
smtp_smarthost: 'smtp.example.com:465'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'your_smtp_password'
smtp_require_tls: false
templates:
- '/etc/alertmanager/templates/*.tmpl'
route:
group_by: ['alertname','cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-webhook'
routes:
- match:
severity: critical
receiver: 'critical-pager'
group_wait: 10s
repeat_interval: 1h
continue: false
- match:
severity: warning
receiver: 'warning-dingtalk'
group_wait: 30s
repeat_interval: 4h
continue: false
- match_re:
job: '(mysql|redis|mongodb).*'
receiver: 'dba-dingtalk'
group_wait: 30s
repeat_interval: 2h
continue: false
- match:
team: order
receiver: 'order-team-webhook'
continue: false
- match:
team: payment
receiver: 'payment-team-webhook'
continue: false
inhibit_rules:
- source_match:
alertname: 'NodeDown'
target_match_re:
alertname: '.+'
equal: ['instance']
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname','instance']
receivers:
- name: 'default-webhook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/ops/send'
send_resolved: true
- name: 'critical-pager'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/critical/send'
send_resolved: true
email_configs:
- to: '[email protected]'
send_resolved: true
headers:
Subject: '[P0-CRITICAL] {{ .GroupLabels.alertname }}'
- name: 'warning-dingtalk'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/warning/send'
send_resolved: true
- name: 'dba-dingtalk'
webhook_configs:
- url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_DBA_TOKEN'
send_resolved: true
- name: 'order-team-webhook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/order/send'
send_resolved: true
- name: 'payment-team-webhook'
webhook_configs:
- url: 'http://localhost:8060/dingtalk/payment/send'
send_resolved: truePrometheus Alert Rules
Rules are organized into groups to enable parallel evaluation. Example node‑monitoring rules:
# /etc/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 2m
labels:
severity: critical
team: ops
annotations:
summary: "节点 {{ $labels.instance }} 宕机"
description: "节点已超过2分钟无响应,请立即排查"
runbook: "https://wiki.internal/runbook/node-down"
- alert: NodeCPUHigh
expr: 1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.85
for: 5m
labels:
severity: warning
team: ops
annotations:
summary: "{{ $labels.instance }} CPU使用率 {{ $value | humanizePercentage }}"
description: "CPU持续5分钟超过85%,检查是否有异常进程"Application‑service rules (example):
# /etc/prometheus/rules/app_alerts.yml
groups:
- name: app_alerts
rules:
- alert: ServiceDown
expr: up{job=~"app-.*"} == 0
for: 1m
labels:
severity: critical
team: ops
annotations:
summary: "服务 {{ $labels.job }} 实例 {{ $labels.instance }} 不可达"
- alert: HighErrorRate
expr: sum by(job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by(job) (rate(http_requests_total[5m])) > 0.05
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} HTTP 5xx错误率 {{ $value | humanizePercentage }}"
description: "错误率超过5%持续3分钟,检查应用日志"Case Studies
Alert Level Strategy (P0‑P3)
Four severity levels with distinct notification channels and response‑time goals:
P0 (core service down): phone, SMS, DingTalk; 5‑minute SLA.
P1 (degraded service): DingTalk + email; 15‑minute SLA.
P2 (resource warning): DingTalk group; 1‑hour SLA.
P3 (informational): email; next‑work‑day handling.
# Routing snippet implementing the levels
route:
routes:
- match:
severity: critical
receiver: 'p0-pager'
group_wait: 10s
- match:
severity: warning
receiver: 'p1-dingtalk'
group_wait: 30s
- match:
severity: info
receiver: 'p2-dingtalk'
group_wait: 1m
- match:
severity: none
receiver: 'p3-email'Enterprise WeChat Webhook Integration
Because Alertmanager does not natively support Enterprise WeChat, a lightweight Python script receives Alertmanager webhooks, formats markdown, and forwards to the WeChat robot API.
#!/usr/bin/env python3
import json, requests
from flask import Flask, request
app = Flask(__name__)
WECOM_WEBHOOK_URL = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_WECOM_KEY"
@app.route('/webhook', methods=['POST'])
def webhook():
data = request.json
status = data.get('status','unknown')
alerts = data.get('alerts',[])
if status == 'firing':
color = "warning"
title = f"告警触发 ({len(alerts)}条)"
else:
color = "info"
title = f"告警恢复 ({len(alerts)}条)"
content = [f"## {title}
"]
for alert in alerts[:10]: # show up to 10 alerts
l = alert.get('labels',{})
a = alert.get('annotations',{})
content.append(f"**{l.get('alertname','N/A')}**")
content.append(f"> 实例: {l.get('instance','N/A')}")
content.append(f"> 级别: {l.get('severity','N/A')}")
content.append(f"> 摘要: {a.get('summary','N/A')}
")
payload = {"msgtype":"markdown","markdown":{"content":"
".join(content)}}
resp = requests.post(WECOM_WEBHOOK_URL, json=payload, timeout=10)
return json.dumps({"status":"ok","wecom_response":resp.status_code})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8065)Best Practices & Security
Rule grouping : Split 600+ rules into ~12 groups; evaluation time dropped from 3.2 s to 0.8 s.
Avoid heavy PromQL : Functions like count(), group_left, label_replace are CPU‑intensive; use recording rules to pre‑aggregate.
Evaluation interval tuning : Default 15 s; set longer intervals (e.g., 60 s) for low‑frequency alerts such as disk space.
Alertmanager deduplication : Use group_by: ['alertname','cluster'] to merge similar alerts; avoid adding instance which would fragment groups.
Basic Auth for Alertmanager UI to prevent unauthorized silencing.
Webhook network isolation : Keep DingTalk/WeChat webhook services on internal network; never expose access tokens publicly.
Mask sensitive data in templates; only expose necessary diagnostics.
Silence audit : Periodic script scans silences older than 7 days and raises a reminder.
High Availability
Deploy at least three Alertmanager instances in cluster mode using gossip ports 9094 (TCP/UDP). Example startup flags show peer configuration. Redundant notification channels (e.g., DingTalk + phone) ensure alerts are delivered even if one channel fails.
# Instance 1
alertmanager --cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=10.0.1.51:9094 --cluster.peer=10.0.1.52:9094
# Instance 2
alertmanager --cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=10.0.1.50:9094 --cluster.peer=10.0.1.52:9094
# Instance 3
alertmanager --cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=10.0.1.50:9094 --cluster.peer=10.0.1.51:9094Troubleshooting
Validate rule syntax with promtool check rules /etc/prometheus/rules/*.yml.
Validate Alertmanager config with amtool check-config /etc/alertmanager/alertmanager.yml.
Test routing:
amtool --alertmanager.url=http://localhost:9093 config routes test severity=warning alertname=NodeCPUHigh.
Inspect logs via journalctl -u alertmanager -f and check for suppressed or inhibited alerts.
Check cluster status: curl -s http://localhost:9093/api/v2/status | jq .cluster.
Performance Monitoring
Key metrics exposed on /metrics include: alertmanager_notifications_failed_total – should be 0. alertmanager_notification_latency_seconds – typical <5 s; alert if >30 s. prometheus_rule_group_duration_seconds – aim <1 s; alert if >5 s.
Active alert count – monitor for spikes.
Backup & Restore
# Backup script (alertmanager_backup.sh)
BACKUP_DIR="/data/backup/alertmanager"
DATE=$(date +%Y%m%d)
mkdir -p "${BACKUP_DIR}"
# Config files
tar czf "${BACKUP_DIR}/alertmanager_config_${DATE}.tar.gz" /etc/alertmanager/ /etc/prometheus/rules/
# State data (silences, notifications)
tar czf "${BACKUP_DIR}/alertmanager_data_${DATE}.tar.gz" /var/lib/alertmanager/
# Cleanup older than 30 days
find "${BACKUP_DIR}" -name "*.tar.gz" -mtime +30 -deleteRestore steps: stop Alertmanager, extract the appropriate tarballs to /, then start the service.
References
Alertmanager official documentation
Awesome Prometheus Alerts repository
prometheus-webhook-dingtalk project
PromQL cheat sheet
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
