Operations 31 min read

How to Build Production‑Grade Prometheus Alert Rules and Silence Policies in 10 Minutes

This guide walks SRE and operations teams through setting up Prometheus alert rule templates, defining severity/team/service labels, configuring Alertmanager routing and receivers, testing alerts, creating scheduled silences, automating silence management via API, implementing inhibition rules, establishing Git‑based review pipelines, persisting alert history to MySQL, and applying security, performance, and compliance best practices.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Build Production‑Grade Prometheus Alert Rules and Silence Policies in 10 Minutes

Prometheus Alert Rule Templates and Silence Policies: Achieve Production‑Grade Zero‑False‑Positive Alerts in 10 Minutes

Applicable Scenarios & Prerequisites

Applicable Business : Medium‑large SRE/operations teams with >50 alerts per day requiring tiered handling and on‑call management.

Prerequisites :

Prometheus ≥ 2.30 (supports rule validation) and Alertmanager ≥ 0.24 (supports silence API v2)

Access ports: Prometheus 9090, Alertmanager 9093

Read/write permission to /etc/prometheus/rules/ and ability to reload configuration ( systemctl reload prometheus)

Network: Alertmanager must reach webhook receivers (e.g., WeChat, DingTalk, PagerDuty)

Environment & Version Matrix

Prometheus 2.30+ (recommended 2.45+), OS RHEL 7/8, Ubuntu 20.04/22.04, resources 2C/4G/50G SSD

Alertmanager 0.24+ (recommended 0.26+), same OS, resources 1C/2G/10G SSD

Rule validation tool: promtool (built‑in)

Network: ensure connectivity to target SMTP/Webhook ports

Quick Checklist

Validate existing rule syntax ( promtool check)

Define severity/team/service labels

Write core alert rules (CPU, memory, disk, process, latency)

Configure Alertmanager routing and receivers (label‑based)

Test alert firing and notification

Create scheduled silences (maintenance windows, test env)

Verify silence effectiveness (Alertmanager logs & API)

Configure alert inhibition rules (node‑level suppresses pod alerts)

Establish alert‑rule review workflow (Git + CI validation)

Persist alert history and audit to database

Step 1: Validate Current Rule Syntax and Performance Impact

Pre‑check : List rule files.

# RHEL/CentOS/Ubuntu common
ls -lh /etc/prometheus/rules/*.yml

Syntax check (must pass) :

promtool check rules /etc/prometheus/rules/*.yml

Expected output shows SUCCESS for each file.

Key parameters : promtool check rules: static syntax check, does not load into Prometheus

On failure it prints line number and error (e.g., unknown function or label mismatch)

Step 2: Define Alert Severity Labeling Scheme

Create a label definition file (documentation only):

# Alert severity (required)
severity:
- critical   # P0, immediate phone call
- warning    # P1, respond within 15 min
- info       # P2, routine inspection

# Responsible team (required)
team:
- sre
- backend
- frontend
- dba

# Service identifier (required)
service:
- api-gateway
- user-service
- order-service
- mysql-cluster

Idempotency notes :

All alerts must contain severity, team, and service tags

Tag values must match Alertmanager routing configuration

Step 3: Write Core Alert Rules (Node Monitoring Example)

Create /etc/prometheus/rules/node-alerts.yml:

groups:
- name: node-critical
  interval: 30s
  rules:
  # CPU high load
  - alert: NodeCPUHighLoad
    expr: (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 80
    for: 3m
    labels:
      severity: warning
      team: sre
      service: infrastructure
    annotations:
      summary: "Node CPU usage continuously > 80%"
      description: "Instance {{ $labels.instance }} CPU usage {{ $value | humanizePercentage }}, lasting 3 min"
      runbook: "https://wiki.example.com/runbook/cpu-high"

  # Memory low
  - alert: NodeMemoryLow
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
    for: 5m
    labels:
      severity: critical
      team: sre
      service: infrastructure
    annotations:
      summary: "Node available memory < 10%"
      description: "Instance {{ $labels.instance }} available memory {{ $value | humanizePercentage }}, possible OOM"
      runbook: "https://wiki.example.com/runbook/memory-low"

  # Root disk usage > 85%
  - alert: NodeDiskSpaceHigh
    expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!~"tmpfs|fuse.*"} / node_filesystem_size_bytes) * 100 < 15
    for: 10m
    labels:
      severity: warning
      team: sre
      service: infrastructure
    annotations:
      summary: "Root partition free space < 15%"
      description: "Instance {{ $labels.instance }} root free {{ $value | humanizePercentage }}"

  # Disk I/O wait > 80%
  - alert: NodeDiskIOWaitHigh
    expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
    for: 5m
    labels:
      severity: warning
      team: sre
      service: infrastructure
    annotations:
      summary: "Disk I/O wait > 80%"
      description: "Instance {{ $labels.instance }} device {{ $labels.device }} I/O busy {{ $value }}%"

  # Critical process missing (e.g., kubelet)
  - alert: NodeProcessDown
    expr: node_systemd_unit_state{name="kubelet.service",state="active"} != 1
    for: 1m
    labels:
      severity: critical
      team: sre
      service: kubernetes
    annotations:
      summary: "Critical process kubelet not running"
      description: "Instance {{ $labels.instance }} kubelet.service abnormal"

Key parameters : for: 3m prevents transient spikes rate(...[5m]) smooths short‑term fluctuations humanizePercentage formats numbers as percentages

Validate rule syntax:

promtool check rules /etc/prometheus/rules/node-alerts.yml

Hot‑reload without restart:

# RHEL/CentOS
systemctl reload prometheus

# Ubuntu
systemctl reload prometheus

# Or via HTTP API
curl -X POST http://localhost:9090/-/reload

Verify load:

# List loaded rule groups
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'

# List active alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}'

Step 4: Configure Alertmanager Routing and Receivers

Edit /etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'YOUR_PASSWORD'

route:
  receiver: 'default-email'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

routes:
  # P0 alerts: phone + WeChat
  - match:
      severity: critical
    receiver: 'oncall-phone'
    group_wait: 10s
    repeat_interval: 30m

  # DBA team alerts
  - match:
      team: dba
    receiver: 'dba-wechat'

  # Test environment: email only
  - match_re:
      env: 'test|dev'
    receiver: 'test-email'
    group_interval: 1h

receivers:
  - name: 'default-email'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: '[{{ .GroupLabels.severity }}] {{ .GroupLabels.alertname }}'

  - name: 'oncall-phone'
    webhook_configs:
      - url: 'https://api.pagerduty.com/webhook/xxx'
        send_resolved: true

  - name: 'dba-wechat'
    wechat_configs:
      - corp_id: 'YOUR_CORP_ID'
        to_user: 'dba-team'
        agent_id: 'YOUR_AGENT_ID'
        api_secret: 'YOUR_SECRET'

  - name: 'test-email'
    email_configs:
      - to: '[email protected]'

# Inhibition rules (suppress pod alerts when node is down)
inhibit_rules:
  - source_match:
      severity: 'critical'
      alertname: 'NodeDown'
    target_match_re:
      alertname: 'Pod.*'
    equal: ['instance']

  - source_match:
      alertname: 'MySQLMasterDown'
      service: 'mysql-cluster'
    target_match:
      alertname: 'MySQLReplicationLag'
    equal: ['cluster']

  - source_match:
      alertname: 'NetworkUnreachable'
    target_match_re:
      alertname: '.*Timeout'
    equal: ['instance']

Key parameters : group_by merges alerts with same labels inhibit_rules suppresses downstream alerts when a source alert fires

Validate configuration:

amtool check-config /etc/alertmanager/alertmanager.yml

Hot‑reload:

systemctl reload alertmanager
# Or via HTTP API
curl -X POST http://localhost:9093/-/reload

Verify:

# List active silences
amtool silence query --alertmanager.url=http://localhost:9093

# Check routes
amtool config routes --alertmanager.url=http://localhost:9093

Step 5: Test Alert Triggering and Notification

Manually generate high CPU load on a node (run for 5 min):

# On the monitored node
timeout 300 sh -c 'while true; do :; done' &

Verify alert appears after 3 min:

# Check Prometheus
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.labels.alertname=="NodeCPUHighLoad")'

# Check Alertmanager
amtool alert --alertmanager.url=http://localhost:9093 | grep NodeCPUHighLoad

Verify notification log entry:

journalctl -u alertmanager -f | grep -E 'Notify|webhook|email'

Step 6: Create Scheduled Silences (Maintenance Windows)

Example 1 – silence all alerts for node-01 for 2 h:

amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="SRE-OnCall" \
  --comment="Planned maintenance: replace memory" \
  --duration=2h \
  instance=~"node-01:.*"

Example 2 – silence all test‑environment alerts for 4 h:

amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="Dev-Team" \
  --comment="Load test: silence test env alerts" \
  --duration=4h \
  env="test"

Example 3 – silence a specific alert during a deployment:

amtool silence add \
  --alertmanager.url=http://localhost:9093 \
  --author="Deploy-Pipeline" \
  --comment="Canary release: silence health‑check alert" \
  --duration=30m \
  alertname="ServiceHealthCheckFailed" \
  service="user-service"

Verify active silences:

amtool silence query --alertmanager.url=http://localhost:9093

Step 7: Automate Silence Management via API

Create silence (JSON API):

curl -X POST http://localhost:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name":"alertname","value":"NodeDown","isRegex":false},
      {"name":"instance","value":"node-02:9100","isRegex":false}
    ],
    "startsAt":"2025-10-31T10:00:00Z",
    "endsAt":"2025-10-31T12:00:00Z",
    "createdBy":"automation-script",
    "comment":"Auto‑created: periodic maintenance window"
}'

Delete silence:

SILENCE_ID="9a1b2c3d-4e5f-6a7b-8c9d-0e1f2a3b4c5d"
curl -X DELETE http://localhost:9093/api/v2/silence/$SILENCE_ID

Step 8: Configure Alert Inhibition Rules (Avoid Cascading Alerts)

Edit the inhibit_rules section of alertmanager.yml as shown earlier. Key fields: source_match: condition that triggers suppression target_match_re: regex for alerts to be suppressed equal: labels that must be identical (e.g., instance, cluster)

Step 9: Establish Alert‑Rule Review Process (Git + CI)

Version‑control rule files:

cd /etc/prometheus
git init
git add rules/*.yml prometheus.yml
git commit -m "Initial alert rules"
git remote add origin [email protected]:sre/prometheus-config.git
git push -u origin main

GitLab CI validation pipeline ( .gitlab-ci.yml):

stages:
  - validate

validate-rules:
  stage: validate
  image: prom/prometheus:v2.45.0
  script:
    - promtool check rules rules/*.yml
    - promtool check config prometheus.yml
  only:
    - merge_requests
    - main

Pre‑commit hook to enforce syntax locally:

cat > /etc/prometheus/.git/hooks/pre-commit <<'EOF'
#!/bin/bash
promtool check rules /etc/prometheus/rules/*.yml || { echo "Rule syntax error, commit rejected"; exit 1; }
EOF
chmod +x /etc/prometheus/.git/hooks/pre-commit

Step 10: Persist Alert History and Audit (MySQL)

Deploy an Alertmanager webhook logger container:

docker run -d \
  --name alertmanager-logger \
  -p 9099:9099 \
  -e DATABASE_URL="mysql://alert_user:password@mysql-server:3306/alerts" \
  tomtom/alertmanager-webhook-logger:latest

Add webhook receiver to Alertmanager:

receivers:
  - name: 'audit-logger'
    webhook_configs:
      - url: 'http://localhost:9099/webhook'
        send_resolved: true

route:
  routes:
    - receiver: 'audit-logger'
      continue: true

Example MySQL table schema:

CREATE TABLE alert_history (
  id BIGINT AUTO_INCREMENT PRIMARY KEY,
  alert_name VARCHAR(128) NOT NULL,
  severity VARCHAR(32),
  instance VARCHAR(256),
  status VARCHAR(32),
  starts_at TIMESTAMP,
  ends_at TIMESTAMP,
  labels JSON,
  annotations JSON,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  INDEX idx_alert_name (alert_name),
  INDEX idx_starts_at (starts_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Monitoring & Alerting Metrics

Self‑monitoring Prometheus metrics (desired thresholds):

# Rule evaluation latency (should be < 1 s)
prometheus_rule_evaluation_duration_seconds{quantile="0.99"} > 1

# Alertmanager notification failure rate (should be < 1 %)
rate(alertmanager_notifications_failed_total[5m]) / rate(alertmanager_notifications_total[5m]) > 0.01

# Active alert count (alert storm detection)
ALERTS{alertstate="firing"} > 50

# Number of active silences (avoid over‑silencing)
alertmanager_silences{state="active"} > 20

Performance & Capacity

Rule evaluation benchmark (100 rules < 1 s):

time promtool check rules /etc/prometheus/rules/*.yml
# Expected: < 1 s for 100 rules

Alert throughput test (simulate 100 alerts):

for i in {1..100}; do
  amtool alert add \
    --alertmanager.url=http://localhost:9093 \
    --annotation=summary="Test alert $i" \
    alertname="LoadTest" instance="test-$i" &
 done
wait

time amtool alert query --alertmanager.url=http://localhost:9093

Expected: single Alertmanager instance handles 1000 alerts/sec, notification delay < 30 s.

Security & Compliance

Access control – basic auth via Nginx reverse proxy:

server {
  listen 9090;
  location / {
    auth_basic "Prometheus";
    auth_basic_user_file /etc/nginx/.htpasswd;
    proxy_pass http://localhost:9090;
  }
}

Generate password file with htpasswd.

Webhook security – bearer token and TLS verification:

receivers:
  - name: 'secure-webhook'
    webhook_configs:
      - url: 'https://receiver.example.com/webhook'
        http_config:
          bearer_token: 'YOUR_SECRET_TOKEN'
        tls_config:
          insecure_skip_verify: false

Audit logging – JSON logs via systemd:

# /etc/systemd/system/alertmanager.service
[Service]
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --log.level=info \
  --log.format=json
StandardOutput=append:/var/log/alertmanager/audit.log

Query audit log with jq for silence creation or notification failures.

Common Failures & Troubleshooting

Symptom

Diagnostic Command

Possible Root Cause

Quick Fix

Permanent Fix

Alert rule not firing promtool check rules /etc/prometheus/rules/*.yml Syntax error or rule not loaded

Fix syntax and reload Prometheus

CI validation + pre‑commit hook

Notification not sent journalctl -u alertmanager -f Routing mismatch or receiver config error

Check label matching

Test receiver connectivity and credentials

Duplicate notifications amtool alert query repeat_interval too short

Increase repeat_interval to 4 h

Set per‑severity repeat intervals

Silence ineffective amtool silence query Label mismatch or time window

Use regex (~) and verify timezone

Dry‑run silences before applying

Alertmanager OOM ps aux | grep alertmanager Too many active alerts/history

Increase memory limit, clean old silences

Enable aggregation, persist to external store

Alert storm amtool alert | wc -l Threshold too low or exporter issue

Temporarily disable rule or create global silence

Tune thresholds, add for‑duration

Change & Rollback Playbook

Maintenance Window

Recommended time: 02:00‑04:00 (low traffic). Preconditions: test in staging, backup configs, create global silence, notify on‑call.

Canary Deployment

Deploy new rules to a single Prometheus instance, observe 30 min, then roll out to the rest via Ansible.

Rollback Conditions

New rule causes > 20 % false‑positive rate

Alert storm (> 100 alerts in 5 min)

Notification failure rate > 10 %

Rollback steps: restore backup rule files, reload Prometheus, restore Alertmanager config, reload, verify with promtool check rules and amtool config routes.

Best Practices

Enforce alert tiering – every rule must have severity/team/service tags.

Set for ≥ 3 min to filter noise.

Group alerts by alertname and cluster to reduce notification volume.

Require --author and --comment for all silences; audit monthly.

Inhibition priority: node > network > application.

Link runbooks in annotations.runbook for P0/P1 alerts.

Control alert fatigue – repeat_interval ≥ 4 h for critical, ≥ 12 h for P2.

Isolate test environment alerts with env label.

Quarterly prune rules not triggered for 90 days.

Aim for notification count ≤ 20 % of raw alerts via aggregation and inhibition.

Appendix – Production Configuration Sample

Full production rule file ( /etc/prometheus/rules/production.yml) and Ansible task snippet are provided in the original source (omitted here for brevity).

Reference Images

Prometheus alerting diagram
Prometheus alerting diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlertingPrometheusAlertmanagerSilencing
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.