How to Build Production‑Grade Prometheus Alert Rules and Silence Policies in 10 Minutes
This guide walks SRE and operations teams through setting up Prometheus alert rule templates, defining severity/team/service labels, configuring Alertmanager routing and receivers, testing alerts, creating scheduled silences, automating silence management via API, implementing inhibition rules, establishing Git‑based review pipelines, persisting alert history to MySQL, and applying security, performance, and compliance best practices.
Prometheus Alert Rule Templates and Silence Policies: Achieve Production‑Grade Zero‑False‑Positive Alerts in 10 Minutes
Applicable Scenarios & Prerequisites
Applicable Business : Medium‑large SRE/operations teams with >50 alerts per day requiring tiered handling and on‑call management.
Prerequisites :
Prometheus ≥ 2.30 (supports rule validation) and Alertmanager ≥ 0.24 (supports silence API v2)
Access ports: Prometheus 9090, Alertmanager 9093
Read/write permission to /etc/prometheus/rules/ and ability to reload configuration ( systemctl reload prometheus)
Network: Alertmanager must reach webhook receivers (e.g., WeChat, DingTalk, PagerDuty)
Environment & Version Matrix
Prometheus 2.30+ (recommended 2.45+), OS RHEL 7/8, Ubuntu 20.04/22.04, resources 2C/4G/50G SSD
Alertmanager 0.24+ (recommended 0.26+), same OS, resources 1C/2G/10G SSD
Rule validation tool: promtool (built‑in)
Network: ensure connectivity to target SMTP/Webhook ports
Quick Checklist
Validate existing rule syntax ( promtool check)
Define severity/team/service labels
Write core alert rules (CPU, memory, disk, process, latency)
Configure Alertmanager routing and receivers (label‑based)
Test alert firing and notification
Create scheduled silences (maintenance windows, test env)
Verify silence effectiveness (Alertmanager logs & API)
Configure alert inhibition rules (node‑level suppresses pod alerts)
Establish alert‑rule review workflow (Git + CI validation)
Persist alert history and audit to database
Step 1: Validate Current Rule Syntax and Performance Impact
Pre‑check : List rule files.
# RHEL/CentOS/Ubuntu common
ls -lh /etc/prometheus/rules/*.ymlSyntax check (must pass) :
promtool check rules /etc/prometheus/rules/*.ymlExpected output shows SUCCESS for each file.
Key parameters : promtool check rules: static syntax check, does not load into Prometheus
On failure it prints line number and error (e.g., unknown function or label mismatch)
Step 2: Define Alert Severity Labeling Scheme
Create a label definition file (documentation only):
# Alert severity (required)
severity:
- critical # P0, immediate phone call
- warning # P1, respond within 15 min
- info # P2, routine inspection
# Responsible team (required)
team:
- sre
- backend
- frontend
- dba
# Service identifier (required)
service:
- api-gateway
- user-service
- order-service
- mysql-clusterIdempotency notes :
All alerts must contain severity, team, and service tags
Tag values must match Alertmanager routing configuration
Step 3: Write Core Alert Rules (Node Monitoring Example)
Create /etc/prometheus/rules/node-alerts.yml:
groups:
- name: node-critical
interval: 30s
rules:
# CPU high load
- alert: NodeCPUHighLoad
expr: (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 80
for: 3m
labels:
severity: warning
team: sre
service: infrastructure
annotations:
summary: "Node CPU usage continuously > 80%"
description: "Instance {{ $labels.instance }} CPU usage {{ $value | humanizePercentage }}, lasting 3 min"
runbook: "https://wiki.example.com/runbook/cpu-high"
# Memory low
- alert: NodeMemoryLow
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: critical
team: sre
service: infrastructure
annotations:
summary: "Node available memory < 10%"
description: "Instance {{ $labels.instance }} available memory {{ $value | humanizePercentage }}, possible OOM"
runbook: "https://wiki.example.com/runbook/memory-low"
# Root disk usage > 85%
- alert: NodeDiskSpaceHigh
expr: (node_filesystem_avail_bytes{mountpoint="/",fstype!~"tmpfs|fuse.*"} / node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: warning
team: sre
service: infrastructure
annotations:
summary: "Root partition free space < 15%"
description: "Instance {{ $labels.instance }} root free {{ $value | humanizePercentage }}"
# Disk I/O wait > 80%
- alert: NodeDiskIOWaitHigh
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 5m
labels:
severity: warning
team: sre
service: infrastructure
annotations:
summary: "Disk I/O wait > 80%"
description: "Instance {{ $labels.instance }} device {{ $labels.device }} I/O busy {{ $value }}%"
# Critical process missing (e.g., kubelet)
- alert: NodeProcessDown
expr: node_systemd_unit_state{name="kubelet.service",state="active"} != 1
for: 1m
labels:
severity: critical
team: sre
service: kubernetes
annotations:
summary: "Critical process kubelet not running"
description: "Instance {{ $labels.instance }} kubelet.service abnormal"Key parameters : for: 3m prevents transient spikes rate(...[5m]) smooths short‑term fluctuations humanizePercentage formats numbers as percentages
Validate rule syntax:
promtool check rules /etc/prometheus/rules/node-alerts.ymlHot‑reload without restart:
# RHEL/CentOS
systemctl reload prometheus
# Ubuntu
systemctl reload prometheus
# Or via HTTP API
curl -X POST http://localhost:9090/-/reloadVerify load:
# List loaded rule groups
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
# List active alerts
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state}'Step 4: Configure Alertmanager Routing and Receivers
Edit /etc/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'YOUR_PASSWORD'
route:
receiver: 'default-email'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# P0 alerts: phone + WeChat
- match:
severity: critical
receiver: 'oncall-phone'
group_wait: 10s
repeat_interval: 30m
# DBA team alerts
- match:
team: dba
receiver: 'dba-wechat'
# Test environment: email only
- match_re:
env: 'test|dev'
receiver: 'test-email'
group_interval: 1h
receivers:
- name: 'default-email'
email_configs:
- to: '[email protected]'
headers:
Subject: '[{{ .GroupLabels.severity }}] {{ .GroupLabels.alertname }}'
- name: 'oncall-phone'
webhook_configs:
- url: 'https://api.pagerduty.com/webhook/xxx'
send_resolved: true
- name: 'dba-wechat'
wechat_configs:
- corp_id: 'YOUR_CORP_ID'
to_user: 'dba-team'
agent_id: 'YOUR_AGENT_ID'
api_secret: 'YOUR_SECRET'
- name: 'test-email'
email_configs:
- to: '[email protected]'
# Inhibition rules (suppress pod alerts when node is down)
inhibit_rules:
- source_match:
severity: 'critical'
alertname: 'NodeDown'
target_match_re:
alertname: 'Pod.*'
equal: ['instance']
- source_match:
alertname: 'MySQLMasterDown'
service: 'mysql-cluster'
target_match:
alertname: 'MySQLReplicationLag'
equal: ['cluster']
- source_match:
alertname: 'NetworkUnreachable'
target_match_re:
alertname: '.*Timeout'
equal: ['instance']Key parameters : group_by merges alerts with same labels inhibit_rules suppresses downstream alerts when a source alert fires
Validate configuration:
amtool check-config /etc/alertmanager/alertmanager.ymlHot‑reload:
systemctl reload alertmanager
# Or via HTTP API
curl -X POST http://localhost:9093/-/reloadVerify:
# List active silences
amtool silence query --alertmanager.url=http://localhost:9093
# Check routes
amtool config routes --alertmanager.url=http://localhost:9093Step 5: Test Alert Triggering and Notification
Manually generate high CPU load on a node (run for 5 min):
# On the monitored node
timeout 300 sh -c 'while true; do :; done' &Verify alert appears after 3 min:
# Check Prometheus
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.labels.alertname=="NodeCPUHighLoad")'
# Check Alertmanager
amtool alert --alertmanager.url=http://localhost:9093 | grep NodeCPUHighLoadVerify notification log entry:
journalctl -u alertmanager -f | grep -E 'Notify|webhook|email'Step 6: Create Scheduled Silences (Maintenance Windows)
Example 1 – silence all alerts for node-01 for 2 h:
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="SRE-OnCall" \
--comment="Planned maintenance: replace memory" \
--duration=2h \
instance=~"node-01:.*"Example 2 – silence all test‑environment alerts for 4 h:
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="Dev-Team" \
--comment="Load test: silence test env alerts" \
--duration=4h \
env="test"Example 3 – silence a specific alert during a deployment:
amtool silence add \
--alertmanager.url=http://localhost:9093 \
--author="Deploy-Pipeline" \
--comment="Canary release: silence health‑check alert" \
--duration=30m \
alertname="ServiceHealthCheckFailed" \
service="user-service"Verify active silences:
amtool silence query --alertmanager.url=http://localhost:9093Step 7: Automate Silence Management via API
Create silence (JSON API):
curl -X POST http://localhost:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name":"alertname","value":"NodeDown","isRegex":false},
{"name":"instance","value":"node-02:9100","isRegex":false}
],
"startsAt":"2025-10-31T10:00:00Z",
"endsAt":"2025-10-31T12:00:00Z",
"createdBy":"automation-script",
"comment":"Auto‑created: periodic maintenance window"
}'Delete silence:
SILENCE_ID="9a1b2c3d-4e5f-6a7b-8c9d-0e1f2a3b4c5d"
curl -X DELETE http://localhost:9093/api/v2/silence/$SILENCE_IDStep 8: Configure Alert Inhibition Rules (Avoid Cascading Alerts)
Edit the inhibit_rules section of alertmanager.yml as shown earlier. Key fields: source_match: condition that triggers suppression target_match_re: regex for alerts to be suppressed equal: labels that must be identical (e.g., instance, cluster)
Step 9: Establish Alert‑Rule Review Process (Git + CI)
Version‑control rule files:
cd /etc/prometheus
git init
git add rules/*.yml prometheus.yml
git commit -m "Initial alert rules"
git remote add origin [email protected]:sre/prometheus-config.git
git push -u origin mainGitLab CI validation pipeline ( .gitlab-ci.yml):
stages:
- validate
validate-rules:
stage: validate
image: prom/prometheus:v2.45.0
script:
- promtool check rules rules/*.yml
- promtool check config prometheus.yml
only:
- merge_requests
- mainPre‑commit hook to enforce syntax locally:
cat > /etc/prometheus/.git/hooks/pre-commit <<'EOF'
#!/bin/bash
promtool check rules /etc/prometheus/rules/*.yml || { echo "Rule syntax error, commit rejected"; exit 1; }
EOF
chmod +x /etc/prometheus/.git/hooks/pre-commitStep 10: Persist Alert History and Audit (MySQL)
Deploy an Alertmanager webhook logger container:
docker run -d \
--name alertmanager-logger \
-p 9099:9099 \
-e DATABASE_URL="mysql://alert_user:password@mysql-server:3306/alerts" \
tomtom/alertmanager-webhook-logger:latestAdd webhook receiver to Alertmanager:
receivers:
- name: 'audit-logger'
webhook_configs:
- url: 'http://localhost:9099/webhook'
send_resolved: true
route:
routes:
- receiver: 'audit-logger'
continue: trueExample MySQL table schema:
CREATE TABLE alert_history (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
alert_name VARCHAR(128) NOT NULL,
severity VARCHAR(32),
instance VARCHAR(256),
status VARCHAR(32),
starts_at TIMESTAMP,
ends_at TIMESTAMP,
labels JSON,
annotations JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_alert_name (alert_name),
INDEX idx_starts_at (starts_at)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;Monitoring & Alerting Metrics
Self‑monitoring Prometheus metrics (desired thresholds):
# Rule evaluation latency (should be < 1 s)
prometheus_rule_evaluation_duration_seconds{quantile="0.99"} > 1
# Alertmanager notification failure rate (should be < 1 %)
rate(alertmanager_notifications_failed_total[5m]) / rate(alertmanager_notifications_total[5m]) > 0.01
# Active alert count (alert storm detection)
ALERTS{alertstate="firing"} > 50
# Number of active silences (avoid over‑silencing)
alertmanager_silences{state="active"} > 20Performance & Capacity
Rule evaluation benchmark (100 rules < 1 s):
time promtool check rules /etc/prometheus/rules/*.yml
# Expected: < 1 s for 100 rulesAlert throughput test (simulate 100 alerts):
for i in {1..100}; do
amtool alert add \
--alertmanager.url=http://localhost:9093 \
--annotation=summary="Test alert $i" \
alertname="LoadTest" instance="test-$i" &
done
wait
time amtool alert query --alertmanager.url=http://localhost:9093Expected: single Alertmanager instance handles 1000 alerts/sec, notification delay < 30 s.
Security & Compliance
Access control – basic auth via Nginx reverse proxy:
server {
listen 9090;
location / {
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:9090;
}
}Generate password file with htpasswd.
Webhook security – bearer token and TLS verification:
receivers:
- name: 'secure-webhook'
webhook_configs:
- url: 'https://receiver.example.com/webhook'
http_config:
bearer_token: 'YOUR_SECRET_TOKEN'
tls_config:
insecure_skip_verify: falseAudit logging – JSON logs via systemd:
# /etc/systemd/system/alertmanager.service
[Service]
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--log.level=info \
--log.format=json
StandardOutput=append:/var/log/alertmanager/audit.logQuery audit log with jq for silence creation or notification failures.
Common Failures & Troubleshooting
Symptom
Diagnostic Command
Possible Root Cause
Quick Fix
Permanent Fix
Alert rule not firing promtool check rules /etc/prometheus/rules/*.yml Syntax error or rule not loaded
Fix syntax and reload Prometheus
CI validation + pre‑commit hook
Notification not sent journalctl -u alertmanager -f Routing mismatch or receiver config error
Check label matching
Test receiver connectivity and credentials
Duplicate notifications amtool alert query repeat_interval too short
Increase repeat_interval to 4 h
Set per‑severity repeat intervals
Silence ineffective amtool silence query Label mismatch or time window
Use regex (~) and verify timezone
Dry‑run silences before applying
Alertmanager OOM ps aux | grep alertmanager Too many active alerts/history
Increase memory limit, clean old silences
Enable aggregation, persist to external store
Alert storm amtool alert | wc -l Threshold too low or exporter issue
Temporarily disable rule or create global silence
Tune thresholds, add for‑duration
Change & Rollback Playbook
Maintenance Window
Recommended time: 02:00‑04:00 (low traffic). Preconditions: test in staging, backup configs, create global silence, notify on‑call.
Canary Deployment
Deploy new rules to a single Prometheus instance, observe 30 min, then roll out to the rest via Ansible.
Rollback Conditions
New rule causes > 20 % false‑positive rate
Alert storm (> 100 alerts in 5 min)
Notification failure rate > 10 %
Rollback steps: restore backup rule files, reload Prometheus, restore Alertmanager config, reload, verify with promtool check rules and amtool config routes.
Best Practices
Enforce alert tiering – every rule must have severity/team/service tags.
Set for ≥ 3 min to filter noise.
Group alerts by alertname and cluster to reduce notification volume.
Require --author and --comment for all silences; audit monthly.
Inhibition priority: node > network > application.
Link runbooks in annotations.runbook for P0/P1 alerts.
Control alert fatigue – repeat_interval ≥ 4 h for critical, ≥ 12 h for P2.
Isolate test environment alerts with env label.
Quarterly prune rules not triggered for 90 days.
Aim for notification count ≤ 20 % of raw alerts via aggregation and inhibition.
Appendix – Production Configuration Sample
Full production rule file ( /etc/prometheus/rules/production.yml) and Ansible task snippet are provided in the original source (omitted here for brevity).
Reference Images
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
