Operations 19 min read

Refactoring Alertmanager: Reducing Noise, Improving Escalation, Suppression, and Silence Management

This article shares practical experiences and solutions for improving an Alertmanager‑based alert system, addressing problems such as noisy alerts, lack of escalation, missing recovery notifications, suppression limitations, and cumbersome silence management by redesigning architecture, adding custom scripts, and extending database support.

Aikesheng Open Source Community

Sep 27, 2022

Refactoring Alertmanager: Reducing Noise, Improving Escalation, Suppression, and Silence Management

Alerting is tightly coupled with operations; a good alert system boosts efficiency and reduces fatigue, while a poor one generates unnecessary noise, especially during off‑hours or when many alerts fire simultaneously.

The author’s environment uses a Prometheus + Alertmanager stack, and the article discusses the challenges encountered and the recent refactor project.

Pre‑work

The production setup runs Prometheus and a shared Alertmanager instance per cluster. The current Alertmanager version is shown:

alertmanager, version 0.17.0 (branch: HEAD, revision: c7551cd75c414dc81df027f691e2eb21d4fd85b2)
  build user:       root@932a86a52b76
  build date:       20190503-09:10:07
  go version:       go1.12.4

Identified Issues

Alert interference: alerts from one cluster appear in another due to a shared Alertmanager.

No automatic escalation: alerts are not promoted to higher‑priority channels or personnel.

Missing recovery notifications.

Limited suppression: fixed intervals, not adaptive to work‑hours vs. off‑hours.

Silence management is cumbersome; the UI often fails to pre‑fill silence rules.

Alertmanager does not support voice alerts.

New Problems After Splitting Alertmanager

To solve interference, the team deployed one Alertmanager per cluster, which introduced new challenges:

How to achieve alert convergence (deduplication) across many instances?

How to manage silences when multiple Alertmanager instances exist?

Refactor Solutions

1. Reducing Interference

Deploying a dedicated Alertmanager per cluster isolates alerts. The collection of all Alertmanager instances can be treated like a single database for easier management.

2. Escalation Logic

Alerts are sent via different media (email → WeChat → SMS → phone) based on time of day and repeat count. A custom script reads active alerts from Alertmanager and dispatches messages accordingly.

Send a detailed message to the DBA by Mail  # always send email
    if now_time > 8 and now_time < 22 :
        Send a simple message to the DBA by WX
    else:
        if alert_count > 3 and phone_count < 3:
            Send a simple simple message to the DBA by phone
        elif alert_count > 3 and phone_count > 3:
            Send a simple message to the leader by phone
        else:
            Send a simple message to the DBA by SMS

3. Convergence & Suppression

A MySQL table tb_alert_for_task records each alert’s unique key, state, count, and next scheduled time. Sample DDL:

CREATE TABLE `tb_alert_for_task` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT '主键',
  `alert_task` varchar(100) DEFAULT '' COMMENT '告警项目',
  `alert_state` tinyint(4) NOT NULL DEFAULT '0' COMMENT '告警状态, 0表示已经恢复, 1表示正在告警',
  `alert_count` int(11) NOT NULL DEFAULT '0' COMMENT '告警的次数, 单个告警项发送的告警次数是10次（每天至多发送十次）',
  `u_time` datetime NOT NULL DEFAULT '2021-12-08 00:00:00' COMMENT '下一次发送告警的时间',
  `alert_remarks` varchar(50) NOT NULL DEFAULT '' COMMENT '告警内容',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_alert_task` (`alert_task`)
) ENGINE=InnoDB AUTO_INCREMENT=7049 DEFAULT CHARSET=utf8mb4;

The script queries active alerts via api/v1/alerts?silenced=false&inhibited=false, extracts alertname, cluster, and instance, and stores them in three dictionaries ( global_alert_name, global_alert_cluster, global_alert_host) to decide the convergence dimension.

def f_get_alert_to_msg(url):
    try:
        res = json.loads(requests.get(url, headers=header, timeout=10).text)
    except Exception as err:
        return {"code": 1, "info": str(err)}
    for temp in res["data"]:
        cluster_name = temp["labels"]["cluster"]
        alert_name = temp["labels"]["alertname"]
        instance_name = cluster_name + ":all"
        if "instance" in temp["labels"]:
            instance_name = temp["labels"]["instance"]
        # ...populate global dictionaries...
    return {"code": 0, "info": "ok"}

Suppression logic checks whether the next scheduled time is in the future or the alert count exceeds ten; matching alerts are skipped.

# suppression logic
select_sql = "select alert_task from tb_tidb_alert_for_task where alert_state = 1 and (u_time > now() or alert_count > 10);"
state, skip_instance = connect_mysql(opt="select", sql={"sql": select_sql})

4. Recovery Handling

When a record remains in the alert table but its instance no longer appears in the active list, the script marks the alert as recovered and resets the count.

sql = """select alert_task from tb_tidb_alert_for_task where alert_state = 1 and alert_remarks = 'tidb集群告警' and u_time < date_add(now(), INTERVAL - 1 MINUTE);"""
state, alert_instance = connect_mysql(opt="select", sql={"sql": sql})
# update state to 0 for recovered alerts

5. Silence Management

Adding a silence uses the /api/v1/silences endpoint; the request must include start/end times in UTC and a single name‑value matcher. The script validates the expiration (max 24 h) and builds the JSON payload.

/api/v1/silences

try:
    expi_time = int(expi_time)  # hours
except Exception as err:
    return {"code": 1, "info": str(err)}
if expi_time > 24:
    return {"code": 1, "info": "The alarm cannot be silent for more than 24 hours"}
# build UTC timestamps and payload dictionary

Deletion of a silence first fetches the silence ID via /api/v1/silences?silenced=false&inhibited=false and then calls /api/v1/silence/{id} with a DELETE request.

/api/v1/silences?silenced=false&inhibited=false
/api/v1/silence/id

for item in id_info["data"]:
    if item["status"]["state"] != "active":
        continue
    if item["matchers"][0]["name"] != name or item["matchers"][0]["value"] != value:
        continue
    url = "http://xxx/api/v1/silence/" + item["id"]
    res = json.loads(requests.delete(url).text)

Key Takeaways

Without a unified platform, managing multiple Alertmanager instances is error‑prone.

Always set a reasonable timeout for silences to avoid forgotten mute periods.

Ensure you have visibility into which alerts are suppressed or escalated.

The author emphasizes that the presented solutions are reference implementations; they may need adaptation to different environments and thorough testing before production use.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations Alerting Prometheus Alertmanager escalation Silencing

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.