How to Build Precise Alerting with Prometheus to Eliminate Alert Storms
This article explains how to use Prometheus to create a precise, end‑to‑end alerting system that shortens detection and diagnosis time, integrates logs and metrics, routes alerts to the right owners, and prevents overwhelming alert storms in production environments.
Intro
The dream of a bug‑free, smooth‑running project is shared by every developer. This article explores how to detect and fix bugs before customers notice them by implementing a precise alerting system based on Prometheus, which integrates a log platform, metric system, and alert system with targeted notifications.
Current State & Problem Location
Many projects suffer from alert storms: undifferentiated alarm messages flood teams, making it hard to pinpoint the root cause. Without a traceId or contextual data, the time spent on analysis grows proportionally to problem complexity, delaying remediation and harming user experience.
Analysis & Solution
Effective troubleshooting requires a complete set of metrics—service latency, error codes, CPU, memory, GC status, custom business indicators, etc. The metric system extracts lightweight data for storage, calculation, and visualization. We chose the open‑source Prometheus, which provides both metric collection and alerting, supports pull and push modes, and offers remote‑storage integrations such as ClickHouse, InfluxDB, PostgreSQL/TimescaleDB, and others.
Prometheus’s alertmanager can route alerts to Slack, DingTalk, email, or webhook, and includes silence and frequency controls to avoid noise. The alert flow is:
Log platform collects logs and exposes a metric‑pull endpoint.
Prometheus scrapes metrics from the endpoint.
Prometheus evaluates alert rules; when a rule matches, an alert is generated.
Alertmanager sends the alert to the log platform via a webhook.
The log platform enriches the alert with module, metric name, traceId, etc., and forwards it to the responsible owner or team.
Practice
Implementation steps:
Deploy a log platform that collects application logs and provides a metric‑pull API.
Configure Prometheus (prometheus.yml) with job definitions, e.g.:
- job_name: 'my_service'
metrics_path: '/metrics'
scrape_interval: 1800s
static_configs:
- targets: ['localhost:9527']Set up alertmanager configuration:
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- "rules.yml"
- "kafka_rules.yml"Create rule files, for example:
groups:
- name: kafkaAlert
rules:
- alert: kafkaDelay
expr: count_over_time(kafka_log{appname='appname'}[1m]) > 0
labels:
metric_name: kafka
module: modulename
metric_expr: kafka_log{appname='appname'}[1m]
annotations:
summary: "Kafka message backlog"
description: "{{$value}} times"Enable remote write to ClickHouse for long‑term storage:
remote_write:
- url: "http://localhost:9201/write"
remote_read:
- url: "http://localhost:9201/read"Configure alert routing in alertmanager:
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'YOUR_WEBHOOK_ENDPOINT'Conclusion
Using Prometheus for precise monitoring and alerting effectively prevents alert storms, speeds up detection and resolution of online issues, and reduces the difficulty of root‑cause analysis for developers. Flexible message routing ensures the right owners are notified promptly, while built‑in metric collection eliminates redundant work. The main drawback is the relatively complex configuration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
NetEase Smart Enterprise Tech+
Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
