How Prometheus Recording Rules Can Reduce Alert Noise by 70%
This guide explains how to use Prometheus Recording Rules to pre‑compute, aggregate, and smooth metrics in large‑scale microservice environments, cutting daily alert noise by up to 70% through hierarchical alert design, practical examples, and best‑practice recommendations.
Overview
In large‑scale microservice environments Prometheus can generate >2000 alerts per day, >70% of which are duplicate or transient. Recording Rules pre‑compute queries, enable aggregation, deduplication and smoothing, reducing alert noise dramatically.
Technical Features
Pre‑computation : store results of complex queries as new series.
Time‑window smoothing : avg_over_time, max_over_time eliminate spikes.
Multi‑dimensional aggregation : group by namespace, service, pod, etc.
Hierarchical alerts : build metric layers from instance to cluster.
Use Cases
Kubernetes clusters with >1000 pods.
Microservice architectures needing service‑level health.
Traffic spikes during peak periods.
Multi‑tenant environments where a single fault can generate hundreds of alerts.
Detailed Steps
Analyze Current Alerts
Run PromQL queries to count firing alerts over the past week and identify high‑frequency rules.
# View Prometheus config
kubectl get configmap prometheus-server -n monitoring -o yaml
# Count alerts in the last 7 days
count_over_time(ALERTS{alertstate="firing"}[7d])
# Current active alerts
count(ALERTS{alertstate="firing"})Typical noise sources include CPU/memory spikes, pod restarts, network latency and disk I/O bursts.
Core Configuration
Recording Rules File Structure
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: node_recording_rules
interval: 30s
rules:
- record: job:node_cpu_utilization:avg5m
expr: 1 - avg by (job, instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: instance:node_cpu_utilization:max15m_avg5m
expr: max_over_time(instance:node_cpu_utilization:avg5m[15m])
- record: cluster:node_cpu_utilization:avg
expr: avg(instance:node_cpu_utilization:avg5m)The interval controls evaluation frequency; a value half of the alert evaluation interval is recommended.
CPU Usage Smoothing Example
# Original noisy rule
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100) > 80
for: 1m
# Optimised rule using Recording Rules
- record: instance:node_cpu_utilization:avg5m
expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: instance:node_cpu_utilization:max15m_avg5m
expr: max_over_time(instance:node_cpu_utilization:avg5m[15m])Alert Rules Based on Pre‑computed Metrics
# /etc/prometheus/rules/alert_rules.yml
groups:
- name: infrastructure_alerts
rules:
- alert: NodeHighCPU
expr: instance:node_cpu_utilization:max15m_avg5m > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} CPU sustained high load"
description: "CPU utilization exceeds 85% over the last 15 minutes (current {{ $value | humanizePercentage }})"
- alert: NodeHighMemory
expr: instance:node_memory_utilization:avg5m > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} memory usage high"
description: "Memory utilization continuously above 90%"Real‑world Cases
Case 1 – E‑commerce Traffic Spike
During a Double‑11 sale request rate rose from 1 k QPS to 50 k QPS. The original error‑rate alert fired >150 times per hour despite a 0.5 % error rate. By defining a 1‑hour baseline and alerting only when the current rate exceeds three times the baseline, alert volume dropped to 5‑10 per hour and false‑positive rate fell from 85 % to 15 %.
# Recording Rules for baseline
- record: service:http_error_rate:baseline1h
expr: avg_over_time((sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (service) (rate(http_requests_total[5m]))) [1h:])
- record: service:http_error_rate:ratio_to_baseline
expr: (sum by (service) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (service) (rate(http_requests_total[5m])) / service:http_error_rate:baseline1h
- alert: ServiceErrorRateSpike
expr: service:http_error_rate:ratio_to_baseline > 3
for: 5m
annotations:
summary: "Service {{ $labels.service }} error‑rate spike"
description: "Current error‑rate is {{ $value }} times the baseline"Case 2 – Multi‑tenant Alert Aggregation
In a SaaS platform with 200+ tenants a single node failure generated >200 pod‑unavailable alerts. Tenant‑level aggregation metrics and Alertmanager group_by / inhibit_rules reduced the notification to a single tenant‑wide alert.
# Tenant aggregation Recording Rules
- record: tenant:pods_unavailable:count
expr: count by (tenant) (kube_pod_status_phase{phase!="Running", phase!="Succeeded"})
- record: tenant:service_availability:ratio
expr: sum by (tenant) (kube_deployment_status_replicas_available) / sum by (tenant) (kube_deployment_spec_replicas)Best Practices and Caveats
Naming Conventions
Use level:metric_name:aggregation_window format, e.g. instance:node_cpu_utilization:avg5m. Avoid ambiguous names.
Layered Design
Separate rules into three layers: raw metric normalisation, infrastructure aggregation, and business‑level metrics. This improves maintainability and performance.
Performance Optimisation
Limit the number of Recording Rules; each rule creates a new series.
Set appropriate interval values (30 s for infra, 15 s for apps, 60 s for business).
Use topk or limit to bound cardinality.
High‑Availability
Deploy identical Recording Rules on all Prometheus replicas. Combine with Alertmanager group_by and inhibit_rules for deduplication.
Troubleshooting and Monitoring
Common Issues
Rule shows “health: unknown” – check syntax with promtool check rules and verify source metrics exist.
No data – ensure the interval is short enough and the underlying metric is being scraped.
Evaluation latency – split large rule groups, increase interval, or optimise expressions.
Monitoring Recording Rules
# Evaluation duration (99th percentile)
histogram_quantile(0.99, rate(prometheus_rule_evaluation_duration_seconds_bucket[5m]))
# Evaluation failures
rate(prometheus_rule_evaluation_failures_total[5m])
# TSDB series count
prometheus_tsdb_head_series
# Process memory usage
process_resident_memory_bytes{job="prometheus"}Backup and Recovery
Configuration Backup Script
#!/bin/bash
BACKUP_DIR="/backup/prometheus/rules"
PROMETHEUS_RULES_DIR="/etc/prometheus/rules"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p "${BACKUP_DIR}"
# Archive all rule files
tar -czvf "${BACKUP_DIR}/rules_${DATE}.tar.gz" -C "${PROMETHEUS_RULES_DIR}" .
# Keep last 30 days
find "${BACKUP_DIR}" -name "rules_*.tar.gz" -mtime +30 -delete
# Verify backup
tar -tzf "${BACKUP_DIR}/rules_${DATE}.tar.gz"Recovery Procedure
Stop Prometheus: systemctl stop prometheus Restore the archive:
tar -xzvf /backup/prometheus/rules/rules_*.tar.gz -C /etc/prometheus/rules/Validate syntax: promtool check rules /etc/prometheus/rules/*.yml Start Prometheus:
systemctl start prometheusConclusion
Key Takeaways
Recording Rules provide pre‑computation, aggregation and smoothing to dramatically cut alert noise.
Consistent naming ( level:metric:window) simplifies maintenance.
Layered rule design scales from raw metrics to business‑level health scores.
Continuous monitoring of rule evaluation latency and series cardinality is essential for performance.
Further Learning
Dynamic alert baselines and machine‑learning‑based anomaly detection.
Large‑scale Prometheus architectures (Thanos, Cortex, federation).
Advanced PromQL techniques (subqueries, offsets, complex aggregations).
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
