Operations 8 min read

How to Build a Highly Available Alertmanager Cluster with Gossip

Learn to set up a highly available Alertmanager cluster using the Gossip protocol, covering deduplication, routing, HA architecture, required cluster parameters, systemd service files, and Prometheus integration, with step‑by‑step commands and configuration examples.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Build a Highly Available Alertmanager Cluster with Gossip
This section explains how to build and configure a highly available Alertmanager cluster.

To improve Prometheus reliability, multiple Prometheus instances with identical configuration are deployed; if one goes down, service continues. Alertmanager deduplicates and groups alerts, routing them to receivers.

However, a single‑node Alertmanager is a single point of failure. Deploying multiple Alertmanager nodes in an HA setup avoids this, but requires a gossip mechanism to prevent duplicate notifications.

The gossip mechanism ensures that when one Alertmanager processes an alert, it informs other nodes, avoiding repeated alerts.

Gossip mechanism

The alert processing flow includes the following stages:

Stage

Description

Silence

Check if the alert matches any silence rule; if so, stop processing.

Wait

Wait for

index

* 5 seconds based on the node's order in the cluster.

Dedup

Check TSDB for already sent alerts; if found, stop.

Send

Send the alert if it hasn't been sent.

Gossip

Notify other Alertmanager nodes that the alert was sent, so they record it.

Key points of gossip:

Silence settings are identical across nodes, ensuring silenced alerts are never sent.

Gossip synchronizes alert status and marks the Wait stage, guaranteeing sequential processing by cluster nodes.

Setting up a local Alertmanager cluster

Before starting the cluster, understand the following parameters:

Parameter

Description

--cluster.listen-address="0.0.0.0:9094"

Cluster service listening address

--cluster.peer

Addresses of peer nodes to join

--cluster.advertise-address

Advertised address

--cluster.gossip-interval

Gossip propagation interval (default 200s)

--cluster.probe-interval

Probe interval for each node

Copy the existing Alertmanager directory to each node and create systemd service files:

<code># Copy existing Alertmanager directories
cp -r alertmanager/ /usr/local/alertmanager01
cp -r alertmanager/ /usr/local/alertmanager02
cp -r alertmanager/ /usr/local/alertmanager03

# Alertmanager01 service
cat <<EOF > /lib/systemd/system/alertmanager01.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager01/bin/alertmanager \
  --config.file=/usr/local/alertmanager01/conf/alertmanager.yml \
  --storage.path=/usr/local/alertmanager01/data \
  --web.listen-address=":19093" \
  --cluster.listen-address=192.168.1.220:19094 \
  --log.level=debug
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

# Alertmanager02 service
cat <<EOF > /lib/systemd/system/alertmanager02.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager02/bin/alertmanager \
  --config.file=/usr/local/alertmanager02/conf/alertmanager.yml \
  --storage.path=/usr/local/alertmanager02/data \
  --web.listen-address=":29093" \
  --cluster.listen-address=192.168.1.220:29094 \
  --cluster.peer=192.168.1.220:19094 \
  --log.level=debug
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

# Alertmanager03 service
cat <<EOF > /lib/systemd/system/alertmanager03.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager03/bin/alertmanager \
  --config.file=/usr/local/alertmanager03/conf/alertmanager.yml \
  --storage.path=/usr/local/alertmanager03/data \
  --web.listen-address=":39093" \
  --cluster.listen-address=192.168.1.220:39094 \
  --cluster.peer=192.168.1.220:19094 \
  --log.level=debug
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

# Enable and start services
systemctl enable alertmanager01 alertmanager02 alertmanager03
systemctl start alertmanager01 alertmanager02 alertmanager03
</code>

After starting, access

http://192.168.1.220:19093

to view the cluster status. In production, use different nodes and IPs as needed.

Prometheus configuration should point to all Alertmanager instances:

<code>alerting:
  alert_relabel_configs:
    - source_labels: [dc]
      regex: (.+)\d+
      target_label: dc
  alertmanagers:
    - static_configs:
        - targets: ['192.168.1.220:19093','192.168.1.220:29093','192.168.1.220:39093']
</code>

Reload Prometheus and visit

http://192.168.1.220:19090/config

to verify. Test HA by stopping one node and triggering an alert.

monitoringopsPrometheusHAAlertmanagerGossip
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.