Operations 8 min read

How to Build a Highly Available Alertmanager Cluster with Gossip

Learn to set up a highly available Alertmanager cluster using the Gossip protocol, covering deduplication, routing, HA architecture, required cluster parameters, systemd service files, and Prometheus integration, with step‑by‑step commands and configuration examples.

Ops Development Stories

Oct 19, 2021

How to Build a Highly Available Alertmanager Cluster with Gossip

This section explains how to build and configure a highly available Alertmanager cluster.

To improve Prometheus reliability, multiple Prometheus instances with identical configuration are deployed; if one goes down, service continues. Alertmanager deduplicates and groups alerts, routing them to receivers.

However, a single‑node Alertmanager is a single point of failure. Deploying multiple Alertmanager nodes in an HA setup avoids this, but requires a gossip mechanism to prevent duplicate notifications.

The gossip mechanism ensures that when one Alertmanager processes an alert, it informs other nodes, avoiding repeated alerts.

Gossip mechanism

The alert processing flow includes the following stages:

Stage

Description Silence Check if the alert matches any silence rule; if so, stop processing. Wait Wait for index * 5 seconds based on the node's order in the cluster. Dedup Check TSDB for already sent alerts; if found, stop. Send Send the alert if it hasn't been sent. Gossip Notify other Alertmanager nodes that the alert was sent, so they record it.

Key points of gossip:

Silence settings are identical across nodes, ensuring silenced alerts are never sent.

Gossip synchronizes alert status and marks the Wait stage, guaranteeing sequential processing by cluster nodes.

Setting up a local Alertmanager cluster

Before starting the cluster, understand the following parameters:

Parameter

Description --cluster.listen-address="0.0.0.0:9094" Cluster service listening address --cluster.peer Addresses of peer nodes to join --cluster.advertise-address Advertised address --cluster.gossip-interval Gossip propagation interval (default 200s) --cluster.probe-interval Probe interval for each node

Copy the existing Alertmanager directory to each node and create systemd service files:

# Copy existing Alertmanager directories
cp -r alertmanager/ /usr/local/alertmanager01
cp -r alertmanager/ /usr/local/alertmanager02
cp -r alertmanager/ /usr/local/alertmanager03

# Alertmanager01 service
cat <<EOF > /lib/systemd/system/alertmanager01.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager01/bin/alertmanager \
  --config.file=/usr/local/alertmanager01/conf/alertmanager.yml \
  --storage.path=/usr/local/alertmanager01/data \
  --web.listen-address=":19093" \
  --cluster.listen-address=192.168.1.220:19094 \
  --log.level=debug
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

# Alertmanager02 service
cat <<EOF > /lib/systemd/system/alertmanager02.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager02/bin/alertmanager \
  --config.file=/usr/local/alertmanager02/conf/alertmanager.yml \
  --storage.path=/usr/local/alertmanager02/data \
  --web.listen-address=":29093" \
  --cluster.listen-address=192.168.1.220:29094 \
  --cluster.peer=192.168.1.220:19094 \
  --log.level=debug
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

# Alertmanager03 service
cat <<EOF > /lib/systemd/system/alertmanager03.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager03/bin/alertmanager \
  --config.file=/usr/local/alertmanager03/conf/alertmanager.yml \
  --storage.path=/usr/local/alertmanager03/data \
  --web.listen-address=":39093" \
  --cluster.listen-address=192.168.1.220:39094 \
  --cluster.peer=192.168.1.220:19094 \
  --log.level=debug
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

# Enable and start services
systemctl enable alertmanager01 alertmanager02 alertmanager03
systemctl start alertmanager01 alertmanager02 alertmanager03

After starting, access http://192.168.1.220:19093 to view the cluster status. In production, use different nodes and IPs as needed.

Prometheus configuration should point to all Alertmanager instances:

alerting:
  alert_relabel_configs:
    - source_labels: [dc]
      regex: (.+)\d+
      target_label: dc
  alertmanagers:
    - static_configs:
        - targets: ['192.168.1.220:19093','192.168.1.220:29093','192.168.1.220:39093']

Reload Prometheus and visit http://192.168.1.220:19090/config to verify. Test HA by stopping one node and triggering an alert.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Ops Prometheus HA Alertmanager Gossip

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.