Operations 13 min read

Design and Implementation of an Integrated Alert Management System Based on Alertmanager

This article describes how ZhaiZhai built an integrated monitoring and alerting system using Prometheus and Alertmanager, defines label conventions, provides a Java SDK for sending alerts, and explains strategies for alert deduplication, grouping, severity levels, suppression, multi-channel notifications, silencing, and historical record keeping.

Zhuanzhuan Tech

Jan 13, 2023

Design and Implementation of an Integrated Alert Management System Based on Alertmanager

Background – ZhaiZhai deployed a unified monitoring system on Prometheus and developed a custom alerting system, but developers received too many alerts daily, causing important alerts to be missed or silenced. Excessive alerts are equivalent to no alerts, and correlated alerts (e.g., SQL errors leading to log flood) make root‑cause analysis difficult.

To address this, an Alertmanager‑based "ZhaiZhai Alert Center" was created.

Specification and SDK

2.1 Sending alerts – Alertmanager exposes an OpenAPI where labels identify and deduplicate alerts, while annotations hold mutable data. startsAt and endsAt represent the alert’s start and end times.

[
    {
        "labels": {
            "<labelname>": "<labelvalue>",
            ...
        },
        "annotations": {
            "<labelname>": "<labelvalue>"
        },
        "startsAt": "<rfc3339>",
        "endsAt": "<rfc3339>",
        "generatorURL": "<generator_url>"
    },
    ...
]

The alert flow requires sending to **all** Alertmanager instances in the cluster (no load‑balancing) to guarantee high availability.

2.2 Common labels – The following label conventions are recommended:

ENV – environment (prod, test, etc.)

APP – service name

SOURCE – origin of the alert (log, JVM, etc.)

NAME – alert name

LEVEL – severity (P0‑P5)

INSTANCE – IP of the affected instance

RECEIVER_TYPE – type of receiver (email, WeChat, etc.)

RECEIVER – actual receiver identifier (e.g., email address)

Label names must match the regex [a-zA-Z_][a-zA-Z0-9_]* and are usually mapped to Chinese for display.

2.3 SDK – A Java SDK simplifies alert creation. The builder automatically adds service name, environment, IP, and severity, placing most fields in labels and the alert value in annotations.

<span>AlertManager alertManager = AlertManager.builder()
    // Alert name (required)
    .name("AlertDemo")
    // Optional custom labels
    .label("label1", "value1")
    .label("label2", "value2")
    // Optional alert value
    .value("123")
    // Send to WeChat users
    .wechat("zhangsan", "lisi")
    .build();

// Synchronous send
alertManager.send();

// Asynchronous send (default thread pool)
alertManager.sendAsync();

// Custom thread‑pool send
ThreadPoolExecutor executor;
executor.execute(alertManager);

3 Alert Noise Reduction

3.1 Grouping and deduplication – Alerts sharing the same label set can be grouped into a single notification, reducing noise when many instances generate identical alerts (e.g., network failure across a cluster).

group_wait – initial wait before the first notification of a group.

group_interval – wait time before sending a new notification after new alerts arrive or alerts recover.

repeat_interval – wait time before repeating a notification when no alerts have changed.

Figure (not shown) illustrates the lifecycle from alert generation to grouped notification.

3.2 Recovery notifications – If no alert is received for 5 minutes, Alertmanager treats the alert as recovered and sends a notification after the configured group_interval. The SDK allows explicit startsAt and endsAt values:

// Alert start time
Date startTime = new Date();
// Alert end time (5 min after start)
Date endTime = new Date(new Date().getTime() + 5 * 60 * 1000);
AlertManager.builder().startsAt(startTime).endsAt(endTime).build();

If endTime is earlier than the current time, the alert is considered already recovered and no notification is sent.

3.3 Severity levels – Six levels (P0‑P5) map to service importance (A‑E). Example configuration:

AlertManager.builder().level(AlertManager.Level.P0)...

Higher severity alerts have shorter group_wait and more granular grouping. For instance, P0 alerts wait 15 s, while P4/P5 wait 3 min.

3.4 Notification merging – Alerts with identical labels except for one differing value are considered similar and merged into a single webhook payload. The similarity rule is: all label key‑value pairs match except one value.

3.5 Alert inhibition – Higher‑severity alerts can suppress lower‑severity ones when they share the same service, environment, source, instance, receiver, and name. Example inhibition rule:

inhibit_rules:
  - source_matchers:
      - _LEVEL="P4"
    target_matchers:
      - _LEVEL="P5"
    equal: ['_APP','_ENV','_SOURCE','_INSTANCE','_RECEIVER','_RECEIVER_TYPE','_NAME']
  - source_matchers:
      - _LEVEL="P3"
    target_matchers:
      - _LEVEL=~"P4|P5"
    equal: ['_APP','_ENV','_SOURCE','_INSTANCE','_RECEIVER','_RECEIVER_TYPE','_NAME']
  # ... other levels omitted

4 Multi‑Channel Notification Mechanism

The system supports WeChat, WeChat groups, email, webhook, SMS, and voice calls. A single alert can specify multiple channels; the platform automatically splits the alert per receiver.

AlertManager.builder()
    .wechat("zhangsan","lisi")               // Enterprise WeChat
    .wechatRobot("WeChatGroupBotKey")          // WeChat group robot
    .mail("[email protected]","[email protected]")
    .webhook("http://www.example.com")
    .sms("188xxxxxxx","180xxxxxxx")
    .phone("188xxxxxxx","180xxxxxxx")
    .build();

5 Unrecovered Alerts

Each alert includes a link to view currently unrecovered alerts, which can be in three states: active, silenced, or inhibited. Active alerts can be silenced, and silenced alerts show who performed the silencing.

6 Silent Alerts

Using Alertmanager OpenAPI, users can create silence rules based on labels. The UI pre‑populates matching labels, and users can add or remove dimensions (e.g., drop INSTANCE to silence all machines). The RECEIVER field is mandatory.

7 Alert History

Alert notifications are retained for three months, allowing users to review past alerts.

8 Conclusion

Alertmanager provides fundamental alert deduplication and suppression capabilities. By extending it with label standards, severity‑based grouping, inhibition, merging, multi‑channel delivery, unrecovered‑alert links, silencing, and history retention, ZhaiZhai mitigated alert flooding. While not a silver bullet, these practices solve most alert‑noise problems and can be adopted by other teams facing similar challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Notification Alertmanager Alert Suppression Alert Routing

Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.