Operations 10 min read

How to Build Precise Alerting with Prometheus to Eliminate Alert Storms

This article explains how to use Prometheus to create a precise, end‑to‑end alerting system that shortens detection and diagnosis time, integrates logs and metrics, routes alerts to the right owners, and prevents overwhelming alert storms in production environments.

NetEase Smart Enterprise Tech+

Apr 14, 2022

How to Build Precise Alerting with Prometheus to Eliminate Alert Storms

Intro

The dream of a bug‑free, smooth‑running project is shared by every developer. This article explores how to detect and fix bugs before customers notice them by implementing a precise alerting system based on Prometheus, which integrates a log platform, metric system, and alert system with targeted notifications.

Current State & Problem Location

Many projects suffer from alert storms: undifferentiated alarm messages flood teams, making it hard to pinpoint the root cause. Without a traceId or contextual data, the time spent on analysis grows proportionally to problem complexity, delaying remediation and harming user experience.

Analysis & Solution

Effective troubleshooting requires a complete set of metrics—service latency, error codes, CPU, memory, GC status, custom business indicators, etc. The metric system extracts lightweight data for storage, calculation, and visualization. We chose the open‑source Prometheus, which provides both metric collection and alerting, supports pull and push modes, and offers remote‑storage integrations such as ClickHouse, InfluxDB, PostgreSQL/TimescaleDB, and others.

Prometheus’s alertmanager can route alerts to Slack, DingTalk, email, or webhook, and includes silence and frequency controls to avoid noise. The alert flow is:

Log platform collects logs and exposes a metric‑pull endpoint.

Prometheus scrapes metrics from the endpoint.

Prometheus evaluates alert rules; when a rule matches, an alert is generated.

Alertmanager sends the alert to the log platform via a webhook.

The log platform enriches the alert with module, metric name, traceId, etc., and forwards it to the responsible owner or team.

Practice

Implementation steps:

Deploy a log platform that collects application logs and provides a metric‑pull API.

Configure Prometheus (prometheus.yml) with job definitions, e.g.:

- job_name: 'my_service'
  metrics_path: '/metrics'
  scrape_interval: 1800s
  static_configs:
    - targets: ['localhost:9527']

Set up alertmanager configuration:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']
rule_files:
  - "rules.yml"
  - "kafka_rules.yml"

Create rule files, for example:

groups:
  - name: kafkaAlert
    rules:
      - alert: kafkaDelay
        expr: count_over_time(kafka_log{appname='appname'}[1m]) > 0
        labels:
          metric_name: kafka
          module: modulename
          metric_expr: kafka_log{appname='appname'}[1m]
        annotations:
          summary: "Kafka message backlog"
          description: "{{$value}} times"

Enable remote write to ClickHouse for long‑term storage:

remote_write:
  - url: "http://localhost:9201/write"
remote_read:
  - url: "http://localhost:9201/read"

Configure alert routing in alertmanager:

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1m
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'YOUR_WEBHOOK_ENDPOINT'

Conclusion

Using Prometheus for precise monitoring and alerting effectively prevents alert storms, speeds up detection and resolution of online issues, and reduces the difficulty of root‑cause analysis for developers. Flexible message routing ensures the right owners are notified promptly, while built‑in metric collection eliminates redundant work. The main drawback is the relatively complex configuration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

observability devops Metrics alerting Prometheus

Written by

NetEase Smart Enterprise Tech+

Get cutting-edge insights from NetEase's CTO, access the most valuable tech knowledge, and learn NetEase's latest best practices. NetEase Smart Enterprise Tech+ helps you grow from a thinker into a tech expert.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.