How to Build a Highly Available Alertmanager Cluster with Gossip
Learn to set up a highly available Alertmanager cluster using the Gossip protocol, covering deduplication, routing, HA architecture, required cluster parameters, systemd service files, and Prometheus integration, with step‑by‑step commands and configuration examples.
This section explains how to build and configure a highly available Alertmanager cluster.
To improve Prometheus reliability, multiple Prometheus instances with identical configuration are deployed; if one goes down, service continues. Alertmanager deduplicates and groups alerts, routing them to receivers.
However, a single‑node Alertmanager is a single point of failure. Deploying multiple Alertmanager nodes in an HA setup avoids this, but requires a gossip mechanism to prevent duplicate notifications.
The gossip mechanism ensures that when one Alertmanager processes an alert, it informs other nodes, avoiding repeated alerts.
Gossip mechanism
The alert processing flow includes the following stages:
Stage
Description
SilenceCheck if the alert matches any silence rule; if so, stop processing.
WaitWait for
index* 5 seconds based on the node's order in the cluster.
DedupCheck TSDB for already sent alerts; if found, stop.
SendSend the alert if it hasn't been sent.
GossipNotify other Alertmanager nodes that the alert was sent, so they record it.
Key points of gossip:
Silence settings are identical across nodes, ensuring silenced alerts are never sent.
Gossip synchronizes alert status and marks the Wait stage, guaranteeing sequential processing by cluster nodes.
Setting up a local Alertmanager cluster
Before starting the cluster, understand the following parameters:
Parameter
Description
--cluster.listen-address="0.0.0.0:9094"Cluster service listening address
--cluster.peerAddresses of peer nodes to join
--cluster.advertise-addressAdvertised address
--cluster.gossip-intervalGossip propagation interval (default 200s)
--cluster.probe-intervalProbe interval for each node
Copy the existing Alertmanager directory to each node and create systemd service files:
<code># Copy existing Alertmanager directories
cp -r alertmanager/ /usr/local/alertmanager01
cp -r alertmanager/ /usr/local/alertmanager02
cp -r alertmanager/ /usr/local/alertmanager03
# Alertmanager01 service
cat <<EOF > /lib/systemd/system/alertmanager01.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager01/bin/alertmanager \
--config.file=/usr/local/alertmanager01/conf/alertmanager.yml \
--storage.path=/usr/local/alertmanager01/data \
--web.listen-address=":19093" \
--cluster.listen-address=192.168.1.220:19094 \
--log.level=debug
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
EOF
# Alertmanager02 service
cat <<EOF > /lib/systemd/system/alertmanager02.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager02/bin/alertmanager \
--config.file=/usr/local/alertmanager02/conf/alertmanager.yml \
--storage.path=/usr/local/alertmanager02/data \
--web.listen-address=":29093" \
--cluster.listen-address=192.168.1.220:29094 \
--cluster.peer=192.168.1.220:19094 \
--log.level=debug
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
EOF
# Alertmanager03 service
cat <<EOF > /lib/systemd/system/alertmanager03.service
[Unit]
Description=alertmanager
Documentation=https://prometheus.io/
After=network.target
StartLimitIntervalSec=0
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/alertmanager03/bin/alertmanager \
--config.file=/usr/local/alertmanager03/conf/alertmanager.yml \
--storage.path=/usr/local/alertmanager03/data \
--web.listen-address=":39093" \
--cluster.listen-address=192.168.1.220:39094 \
--cluster.peer=192.168.1.220:19094 \
--log.level=debug
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
EOF
# Enable and start services
systemctl enable alertmanager01 alertmanager02 alertmanager03
systemctl start alertmanager01 alertmanager02 alertmanager03
</code>After starting, access
http://192.168.1.220:19093to view the cluster status. In production, use different nodes and IPs as needed.
Prometheus configuration should point to all Alertmanager instances:
<code>alerting:
alert_relabel_configs:
- source_labels: [dc]
regex: (.+)\d+
target_label: dc
alertmanagers:
- static_configs:
- targets: ['192.168.1.220:19093','192.168.1.220:29093','192.168.1.220:39093']
</code>Reload Prometheus and visit
http://192.168.1.220:19090/configto verify. Test HA by stopping one node and triggering an alert.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.