Operations 34 min read

Enterprise Monitoring with Prometheus: Rule Hierarchy and Alertmanager Notification Orchestration

This guide explains how to turn a fully built Prometheus monitoring system into a closed‑loop alerting solution by designing layered PromQL rules, configuring Alertmanager routing, grouping, inhibition and silencing, integrating DingTalk and WeChat webhooks, and applying best‑practice performance, security, high‑availability, and troubleshooting techniques.

Raymond Ops

Jun 17, 2026

Enterprise Monitoring with Prometheus: Rule Hierarchy and Alertmanager Notification Orchestration

Overview

Monitoring without alerts is useless. Prometheus evaluates alert rules via PromQL and forwards triggered alerts to Alertmanager, which handles deduplication, grouping, routing, silencing, and notification through email, DingTalk, Enterprise WeChat, or custom webhooks.

PromQL‑driven rules : examples such as "CPU usage > 85% for 5 min", "disk will fill in 24 h", and "HTTP error rate 3× normal".

Routing tree + grouping : alerts are grouped by ['alertname','cluster'] to avoid alert storms; a network fault that generated 80 alerts was reduced to 3 notifications.

Inhibition and silencing : node‑down alerts suppress service alerts; silences can be applied during maintenance windows.

Environment Requirements

Prometheus >= 2.45 (alert rule evaluation engine; must match Alertmanager version).

Alertmanager >= 0.27 (fixes key bugs in cluster mode).

OS: CentOS 7+ / Ubuntu 20.04+ (Alertmanager runs on 1 CPU + 1 GB RAM).

Network: Prometheus → Alertmanager on TCP 9093; Alertmanager cluster gossip on TCP/UDP 9094.

Notification channels: at least one of Email, DingTalk, Enterprise WeChat; redundancy recommended.

Installation and Configuration

Alertmanager installation

# create user
sudo useradd --no-create-home --shell /bin/false alertmanager
# download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
cd alertmanager-0.27.0.linux-amd64
# install binaries
sudo cp alertmanager /usr/local/bin/
sudo cp amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
# create config directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
# verify version
alertmanager --version

Systemd service

sudo tee /etc/systemd/system/alertmanager.service > /dev/null << 'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=0.0.0.0:9093 \
  --web.external-url=http://alertmanager.example.com:9093 \
  --cluster.listen-address=0.0.0.0:9094 \
  --log.level=info \
  --data.retention=120h
Restart=always
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
EOF

Alertmanager main configuration (alertmanager.yml)

global:
  resolve_timeout: 5m
  smtp_from: '[email protected]'
  smtp_smarthost: 'smtp.example.com:465'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'your_smtp_password'
  smtp_require_tls: false

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  group_by: ['alertname','cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-webhook'
  routes:
    - match:
        severity: critical
      receiver: 'critical-pager'
      group_wait: 10s
      repeat_interval: 1h
      continue: false
    - match:
        severity: warning
      receiver: 'warning-dingtalk'
      group_wait: 30s
      repeat_interval: 4h
      continue: false
    - match_re:
        job: '(mysql|redis|mongodb).*'
      receiver: 'dba-dingtalk'
      group_wait: 30s
      repeat_interval: 2h
      continue: false
    - match:
        team: order
      receiver: 'order-team-webhook'
      continue: false
    - match:
        team: payment
      receiver: 'payment-team-webhook'
      continue: false

inhibit_rules:
  - source_match:
      alertname: 'NodeDown'
    target_match_re:
      alertname: '.+'
    equal: ['instance']
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname','instance']

receivers:
  - name: 'default-webhook'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/ops/send'
        send_resolved: true
  - name: 'critical-pager'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/critical/send'
        send_resolved: true
    email_configs:
      - to: '[email protected]'
        send_resolved: true
        headers:
          Subject: '[P0-CRITICAL] {{ .GroupLabels.alertname }}'
  - name: 'warning-dingtalk'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/warning/send'
        send_resolved: true
  - name: 'dba-dingtalk'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=YOUR_DBA_TOKEN'
        send_resolved: true
  - name: 'order-team-webhook'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/order/send'
        send_resolved: true
  - name: 'payment-team-webhook'
    webhook_configs:
      - url: 'http://localhost:8060/dingtalk/payment/send'
        send_resolved: true

Prometheus Alert Rules

Rules are organized into groups to enable parallel evaluation. Example node‑monitoring rules:

# /etc/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
  rules:
  - alert: NodeDown
    expr: up{job="node-exporter"} == 0
    for: 2m
    labels:
      severity: critical
      team: ops
    annotations:
      summary: "节点 {{ $labels.instance }} 宕机"
      description: "节点已超过2分钟无响应，请立即排查"
      runbook: "https://wiki.internal/runbook/node-down"
  - alert: NodeCPUHigh
    expr: 1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.85
    for: 5m
    labels:
      severity: warning
      team: ops
    annotations:
      summary: "{{ $labels.instance }} CPU使用率 {{ $value | humanizePercentage }}"
      description: "CPU持续5分钟超过85%，检查是否有异常进程"

Application‑service rules (example):

# /etc/prometheus/rules/app_alerts.yml
groups:
- name: app_alerts
  rules:
  - alert: ServiceDown
    expr: up{job=~"app-.*"} == 0
    for: 1m
    labels:
      severity: critical
      team: ops
    annotations:
      summary: "服务 {{ $labels.job }} 实例 {{ $labels.instance }} 不可达"
  - alert: HighErrorRate
    expr: sum by(job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by(job) (rate(http_requests_total[5m])) > 0.05
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "{{ $labels.job }} HTTP 5xx错误率 {{ $value | humanizePercentage }}"
      description: "错误率超过5%持续3分钟，检查应用日志"

Case Studies

Alert Level Strategy (P0‑P3)

Four severity levels with distinct notification channels and response‑time goals:

P0 (core service down): phone, SMS, DingTalk; 5‑minute SLA.

P1 (degraded service): DingTalk + email; 15‑minute SLA.

P2 (resource warning): DingTalk group; 1‑hour SLA.

P3 (informational): email; next‑work‑day handling.

# Routing snippet implementing the levels
route:
  routes:
    - match:
        severity: critical
      receiver: 'p0-pager'
      group_wait: 10s
    - match:
        severity: warning
      receiver: 'p1-dingtalk'
      group_wait: 30s
    - match:
        severity: info
      receiver: 'p2-dingtalk'
      group_wait: 1m
    - match:
        severity: none
      receiver: 'p3-email'

Enterprise WeChat Webhook Integration

Because Alertmanager does not natively support Enterprise WeChat, a lightweight Python script receives Alertmanager webhooks, formats markdown, and forwards to the WeChat robot API.

#!/usr/bin/env python3
import json, requests
from flask import Flask, request
app = Flask(__name__)
WECOM_WEBHOOK_URL = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_WECOM_KEY"
@app.route('/webhook', methods=['POST'])
def webhook():
    data = request.json
    status = data.get('status','unknown')
    alerts = data.get('alerts',[])
    if status == 'firing':
        color = "warning"
        title = f"告警触发 ({len(alerts)}条)"
    else:
        color = "info"
        title = f"告警恢复 ({len(alerts)}条)"
    content = [f"## {title}
"]
    for alert in alerts[:10]:  # show up to 10 alerts
        l = alert.get('labels',{})
        a = alert.get('annotations',{})
        content.append(f"**{l.get('alertname','N/A')}**")
        content.append(f"> 实例: {l.get('instance','N/A')}")
        content.append(f"> 级别: {l.get('severity','N/A')}")
        content.append(f"> 摘要: {a.get('summary','N/A')}
")
    payload = {"msgtype":"markdown","markdown":{"content":"
".join(content)}}
    resp = requests.post(WECOM_WEBHOOK_URL, json=payload, timeout=10)
    return json.dumps({"status":"ok","wecom_response":resp.status_code})
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8065)

Best Practices & Security

Rule grouping : Split 600+ rules into ~12 groups; evaluation time dropped from 3.2 s to 0.8 s.

Avoid heavy PromQL : Functions like count(), group_left, label_replace are CPU‑intensive; use recording rules to pre‑aggregate.

Evaluation interval tuning : Default 15 s; set longer intervals (e.g., 60 s) for low‑frequency alerts such as disk space.

Alertmanager deduplication : Use group_by: ['alertname','cluster'] to merge similar alerts; avoid adding instance which would fragment groups.

Basic Auth for Alertmanager UI to prevent unauthorized silencing.

Webhook network isolation : Keep DingTalk/WeChat webhook services on internal network; never expose access tokens publicly.

Mask sensitive data in templates; only expose necessary diagnostics.

Silence audit : Periodic script scans silences older than 7 days and raises a reminder.

High Availability

Deploy at least three Alertmanager instances in cluster mode using gossip ports 9094 (TCP/UDP). Example startup flags show peer configuration. Redundant notification channels (e.g., DingTalk + phone) ensure alerts are delivered even if one channel fails.

# Instance 1
alertmanager --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=10.0.1.51:9094 --cluster.peer=10.0.1.52:9094
# Instance 2
alertmanager --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=10.0.1.50:9094 --cluster.peer=10.0.1.52:9094
# Instance 3
alertmanager --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=10.0.1.50:9094 --cluster.peer=10.0.1.51:9094

Troubleshooting

Validate rule syntax with promtool check rules /etc/prometheus/rules/*.yml.

Validate Alertmanager config with amtool check-config /etc/alertmanager/alertmanager.yml.

Test routing:

amtool --alertmanager.url=http://localhost:9093 config routes test severity=warning alertname=NodeCPUHigh

Inspect logs via journalctl -u alertmanager -f and check for suppressed or inhibited alerts.

Check cluster status: curl -s http://localhost:9093/api/v2/status | jq .cluster.

Performance Monitoring

Key metrics exposed on /metrics include: alertmanager_notifications_failed_total – should be 0. alertmanager_notification_latency_seconds – typical <5 s; alert if >30 s. prometheus_rule_group_duration_seconds – aim <1 s; alert if >5 s.

Active alert count – monitor for spikes.

Backup & Restore

# Backup script (alertmanager_backup.sh)
BACKUP_DIR="/data/backup/alertmanager"
DATE=$(date +%Y%m%d)
mkdir -p "${BACKUP_DIR}"
# Config files
tar czf "${BACKUP_DIR}/alertmanager_config_${DATE}.tar.gz" /etc/alertmanager/ /etc/prometheus/rules/
# State data (silences, notifications)
tar czf "${BACKUP_DIR}/alertmanager_data_${DATE}.tar.gz" /var/lib/alertmanager/
# Cleanup older than 30 days
find "${BACKUP_DIR}" -name "*.tar.gz" -mtime +30 -delete

Restore steps: stop Alertmanager, extract the appropriate tarballs to /, then start the service.

References

Alertmanager official documentation

Awesome Prometheus Alerts repository

prometheus-webhook-dingtalk project

PromQL cheat sheet

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring High Availability Kubernetes DevOps Alerting Prometheus PromQL Alertmanager

Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.