Mastering PostgreSQL Replication: Diagnose Lag, Split‑Brain, and Fix Common Issues
This comprehensive guide walks you through troubleshooting PostgreSQL physical (stream) replication, covering environment prerequisites, anti‑pattern warnings, step‑by‑step diagnostics for replication lag, split‑brain scenarios, replication slot problems, monitoring setup with Prometheus, and best‑practice recommendations to keep your primary‑standby cluster healthy.
PostgreSQL Master‑Slave Replication Troubleshooting: From Lag to Split‑Brain Diagnosis and Fixes
Applicable Scenarios & Prerequisites
Scenario: PostgreSQL master‑slave streaming replication, physical replication, logical replication troubleshooting
OS: RHEL/CentOS 8.5+ or Ubuntu 20.04+
PostgreSQL version: 12+ (recommended 14+ / 15+ for better replication monitoring)
Hardware: Minimum 4C8G / Recommended 8C16G with SSD
Network: Master‑slave latency < 10 ms, bandwidth > 1 Gbps
Permissions: postgres user, superuser rights
Skills: Familiarity with PostgreSQL architecture, WAL mechanism, replication principles, SQL tuning
Anti‑Pattern Warnings (When Not to Use This Guide)
Logical replication issues – this guide focuses on physical (stream) replication.
Managed cloud databases (AWS RDS, Alibaba Cloud RDS) – cannot directly access WAL files or replication slots.
PostgreSQL < 12 – some monitoring views and parameters are unavailable.
Non‑replication architectures – single‑node PostgreSQL or HA solutions like Pgpool‑II/Patroni require dedicated documentation.
Cross‑version replication – large version gaps (e.g., 12 → 15) may cause compatibility problems.
Alternative Solutions Comparison
Logical replication → use pglogical or native logical replication (different fault‑handling).
HA clusters → Patroni / Stolon (automatic failover mechanisms).
Cloud environments → cloud provider monitoring tools (no OS‑level access).
Cross‑major‑version replication → combine logical replication with pg_upgrade (physical replication does not support it).
Environment & Version Matrix
OS: RHEL 9.3 / CentOS Stream 9, Ubuntu 22.04 LTS (tested)
PostgreSQL: 14.10 (repo) / 15.5 (official packages) – both tested
Minimum specs: 4C8G / 100 GB SSD
Recommended specs: 8C16G / 500 GB NVMe SSD
Network: Gigabit NIC, latency < 10 ms
Version Differences
PostgreSQL 12 → 14: WAL compression (wal_compression) and replication‑slot monitoring views added.
PostgreSQL 14 → 15: Improved logical replication publish/subscribe, WAL format optimizations.
Replication protocol is backward compatible (primary version ≥ standby version).
Reading Navigation
Quick Start (≈30 min): Checklist → Common fault table → Corresponding fix steps
Deep Dive (≈90 min): Minimal principles → Diagnosis flowchart → Full troubleshooting steps → Best practices
Emergency Fault Handling: Fault classification → Diagnostic commands → Quick fix → Permanent solution
Checklist
Preparation
Backup primary and standby configuration files ( postgresql.conf, pg_hba.conf)
Collect replication status ( pg_stat_replication, pg_stat_wal_receiver)
Check WAL directory space ( pg_wal/)
Verify network connectivity (ping between nodes)
Diagnosis
Check primary replication status: SELECT * FROM pg_stat_replication; Check standby receiver status: SELECT * FROM pg_stat_wal_receiver; Calculate replication lag (WAL LSN difference)
Inspect replication slot status
Repair
Apply fix steps based on fault type
Validate replication recovery (lag reduced, data consistent)
Clean up redundant WAL files if needed
Monitoring
Configure Prometheus + postgres_exporter Set alert rules for lag, replication break, slot overflow, WAL directory size, standby reachability
Implementation Steps
Step 1: Replication Status Diagnosis
Goal: Quickly determine whether replication is normal and identify fault type.
1.1 Primary Replication Status Check
Connect to primary: psql -U postgres -h <primary_ip> -d postgres Run diagnostic query:
SELECT pid, usename, application_name, client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, sync_priority, sync_state, reply_time FROM pg_stat_replication;Typical normal output shows state = streaming, sent_lsn ≈ replay_lsn, and recent reply_time.
1.2 Standby Replication Status Check
Connect to standby and run:
SELECT pid, status, receive_start_lsn, received_lsn, last_msg_receipt_time, slot_name, sender_host FROM pg_stat_wal_receiver;Normal output shows status = streaming and recent timestamps.
Step 2: Replication Lag Diagnosis & Repair
2.1 Lag Type Classification
2.2 Lag Diagnosis SQL
SELECT application_name, client_addr,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)/1024/1024 AS send_lag_mb,
pg_wal_lsn_diff(sent_lsn, write_lsn)/1024/1024 AS receive_lag_mb,
pg_wal_lsn_diff(write_lsn, flush_lsn)/1024/1024 AS flush_lag_mb,
pg_wal_lsn_diff(flush_lsn, replay_lsn)/1024/1024 AS replay_lag_mb,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)/1024/1024 AS total_lag_mb,
EXTRACT(EPOCH FROM (now() - reply_time)) AS reply_delay_seconds
FROM pg_stat_replication;Interpretation: send_lag_mb > 10 → primary WAL sending slow (high load, insufficient max_wal_senders) receive_lag_mb > 10 → network transmission slow (bandwidth, packet loss) replay_lag_mb > 100 → standby apply slow (CPU/IO bottleneck, long transactions)
2.3 Primary‑Side (Send) Issues
Diagnostics:
SELECT count(*) AS active_connections,
count(*) FILTER (WHERE state = 'active') AS active_queries
FROM pg_stat_activity;
SELECT pg_current_wal_lsn(),
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')/1024/1024 AS total_wal_mb;
SELECT count(*) FROM pg_stat_replication;
SHOW max_wal_senders;Repair steps:
-- Increase max_wal_senders (requires restart)
ALTER SYSTEM SET max_wal_senders = 10;
SELECT pg_reload_conf();
-- Reduce primary load: terminate long‑running queries
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '10 minutes' AND pid != pg_backend_pid();
-- Enable WAL compression (PG 14+)
ALTER SYSTEM SET wal_compression = on;
SELECT pg_reload_conf();2.4 Network (Receive) Issues
Diagnostics:
# Ping test
ping -c 100 <standby_ip>
# Bandwidth test with iperf3
iperf3 -s # on primary
iperf3 -c <primary_ip> -t 60 # on standby
# Check retransmissions
netstat -s | grep -i retransRepair steps:
ALTER SYSTEM SET wal_sender_timeout = '120s';
SELECT pg_reload_conf();
-- Verify firewall / routing allow full TCP traffic between nodes
-- Enable TCP keepalive in postgresql.conf
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 3
SELECT pg_reload_conf();2.5 Standby‑Side (Apply) Issues
Diagnostics:
SELECT pid, usename, application_name, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND pid != pg_backend_pid() ORDER BY query_start;
SELECT pg_blocking_pids(pid) AS blocking_pids, pid, usename, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0;
SELECT * FROM pg_stat_bgwriter;Repair steps:
-- Terminate long‑running queries on standby
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '30 minutes' AND pid != pg_backend_pid();
-- Increase max_standby_streaming_delay
ALTER SYSTEM SET max_standby_streaming_delay = 300s; -- 5 minutes
SELECT pg_reload_conf();
-- Upgrade hardware (SSD, RAM, CPU) and monitor IO with iostat
-- Enable parallel WAL apply (PG 14+)
ALTER SYSTEM SET recovery_parallelism = 4;
SELECT pg_reload_conf();Step 3: Split‑Brain Diagnosis & Repair
Scenario: Network partition promotes standby to primary while original primary remains running.
3.1 Split‑Brain Detection
SELECT pg_is_in_recovery(); -- f = primary, t = standby
-- If multiple nodes return f, split‑brain has occurred
SELECT timeline_id, redo_lsn FROM pg_control_checkpoint(); -- Different timeline_id indicates a failover3.2 Split‑Brain Repair (Rebuild Standby)
# 1. Stop the former primary (now standby)
ssh postgres@<former_primary_ip>
pg_ctl stop -D /var/lib/postgresql/15/main
# 2. Backup its data directory
tar -czf /backup/old_primary_$(date +%Y%m%d).tar.gz /var/lib/postgresql/15/main
# 3. Empty the data directory
rm -rf /var/lib/postgresql/15/main/*
# 4. Re‑basebackup from the new primary
pg_basebackup -h <new_primary_ip> -U replica -D /var/lib/postgresql/15/main -Fp -Xs -P -R
# 5. Verify configuration (primary_conninfo points to new primary)
vim /var/lib/postgresql/15/main/postgresql.auto.conf
# 6. Start the former primary as standby
pg_ctl start -D /var/lib/postgresql/15/main
# 7. Verify replication
psql -U postgres -c "SELECT pg_is_in_recovery();" -- should return t
psql -U postgres -c "SELECT * FROM pg_stat_wal_receiver;"Step 4: Replication Slot Fault Diagnosis & Repair
Replication slots prevent the primary from removing WAL files that standbys still need.
4.1 Check Slot Status
SELECT slot_name, slot_type, database, active, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size FROM pg_replication_slots;Normal output shows active = t and reasonable restart_lsn.
4.2 Slot Full (WAL Accumulation)
Diagnostics:
du -sh /var/lib/postgresql/15/main/pg_wal # expect < 5 GB, abnormal > 50 GB
ls -1 /var/lib/postgresql/15/main/pg_wal | wc -l # expect < 100 files, abnormal > 1000Repair steps:
-- Check slot lag
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)/1024/1024 AS slot_lag_mb FROM pg_replication_slots;
-- Drop inactive slot if standby is permanently offline
SELECT pg_drop_replication_slot('standby1_slot');
-- Increase max_slot_wal_keep_size (PG 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '50GB';
SELECT pg_reload_conf();
-- Optionally clean WAL manually (use with caution)
pg_archivecleanup /var/lib/postgresql/15/main/pg_wal 000000010000000000000050Step 5: Monitoring & Alerting Configuration
5.1 Prometheus + postgres_exporter
Install exporter:
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-amd64.tar.gz
tar -zxvf postgres_exporter-0.15.0.linux-amd64.tar.gz
cp postgres_exporter /usr/local/bin/
# Create monitoring user
psql -U postgres -c "CREATE USER exporter WITH PASSWORD 'exporter_password';"
psql -U postgres -c "GRANT pg_monitor TO exporter;"
# Start exporter
export DATA_SOURCE_NAME="postgresql://exporter:exporter_password@localhost:5432/postgres?sslmode=disable"
postgres_exporter --web.listen-address=:9187 &Prometheus scrape config (prometheus.yml):
scrape_configs:
- job_name: 'postgresql'
static_configs:
- targets: ['192.168.1.10:9187'] # primary
labels:
role: 'primary'
- targets: ['192.168.1.11:9187'] # standby
labels:
role: 'standby'5.2 Key Monitoring Metrics
Replication lag (bytes):
pg_stat_replication_pg_wal_lsn_diff{application_name="standby1"}Replication lag (seconds): pg_replication_lag_seconds Slot lag: pg_replication_slots_pg_wal_lsn_diff WAL file count: pg_stat_wal_files Standby online status: pg_up{role="standby"} Number of streaming connections:
count(pg_stat_replication_state{state="streaming"})5.3 Alert Rules (postgresql.yml)
groups:
- name: postgresql_replication
interval: 30s
rules:
- alert: PostgreSQLReplicationLag
expr: pg_stat_replication_pg_wal_lsn_diff > 100*1024*1024 # >100 MB
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL replication lag too high ({{ $labels.instance }})"
description: "Standby {{ $labels.application_name }} lag {{ $value | humanize }} B"
- alert: PostgreSQLReplicationDown
expr: pg_stat_replication_state{state="streaming"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "PostgreSQL replication stopped ({{ $labels.instance }})"
description: "Standby {{ $labels.application_name }} replication state abnormal"
- alert: PostgreSQLReplicationSlotFull
expr: pg_replication_slots_pg_wal_lsn_diff > 10*1024*1024*1024 # >10 GB
for: 10m
labels:
severity: warning
annotations:
summary: "PostgreSQL replication slot WAL accumulation ({{ $labels.instance }})"
description: "Slot {{ $labels.slot_name }} lag {{ $value | humanize }} B"
- alert: PostgreSQLWALDirectoryFull
expr: pg_stat_wal_size_bytes > 50*1024*1024*1024 # >50 GB
for: 10m
labels:
severity: critical
annotations:
summary: "PostgreSQL WAL directory too large ({{ $labels.instance }})"
description: "pg_wal size {{ $value | humanize }} B"
- alert: PostgreSQLStandbyUnreachable
expr: pg_up{role="standby"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL standby unreachable ({{ $labels.instance }})"
description: "Standby {{ $labels.instance }} offline for >5 min"Fundamental Principles
WAL (Write‑Ahead Logging)
All data changes are first written to WAL before data files.
WAL files reside in pg_wal/ (formerly pg_xlog/).
Fixed size (default 16 MB) with names like 000000010000000000000001.
Replication Flow
Primary writes WAL → WAL sender reads WAL → streams via TCP → WAL receiver on standby writes to pg_wal/ → WAL apply replays changes → standby data matches primaryLSN (Log Sequence Number) pg_current_wal_lsn() – current WAL position on primary. sent_lsn – position sent to standby. replay_lsn – position applied on standby.
Replication Slots
Prevent primary from removing WAL needed by a standby.
Physical slot: slot_type = physical.
Logical slot: slot_type = logical.
Long‑offline standby can cause WAL accumulation.
Timeline ID
Incremented on each failover (primary promotion).
Different timelines cannot replicate directly; standby must be rebuilt.
Why Does Replication Lag Occur?
Primary writes faster than standby can apply (high TPS).
Network bandwidth or latency issues (cross‑region replication).
Standby resource bottlenecks (CPU/IO, lock contention, long queries).
Split‑Brain Risks
Network partition promotes standby to primary while original primary stays up.
Both nodes accept writes → data divergence.
Recovery requires manual selection of a single primary and rebuilding the other.
Common Faults & Debugging
Replication lag > 100 MB
Symptoms: high lag values.
Root causes: primary overload, standby apply bottleneck, network delay.
Quick fix: pause some writes, kill long queries on standby.
Permanent fix: upgrade standby hardware, optimize network, enable WAL compression.
Standby cannot connect
Symptoms: no rows in pg_stat_replication.
Root causes: standby not started, network blocked, missing pg_hba.conf entry.
Quick fix: start standby, open firewall.
Permanent fix: configure auto‑restart, adjust pg_hba.conf.
WAL directory too large
Symptoms: du -sh pg_wal/ shows huge size.
Root causes: replication slot full, standby offline long, archive command failure.
Quick fix: drop unused slots, restart primary.
Permanent fix: set max_slot_wal_keep_size, monitor slot status.
Split‑brain (multiple primaries)
Detection: SELECT pg_is_in_recovery() returns false on more than one node.
Root cause: network partition + automatic failover.
Quick fix: manually pick one primary, demote others.
Permanent fix: use quorum/consensus HA tools (Patroni, Stolon) with a witness node.
Replication slot inactive
Cause: standby mis‑configured primary_slotname or slot name mismatch.
Fix: correct primary_slotname in postgresql.auto.conf.
Timeline mismatch
Cause: failover created different timeline IDs.
Fix: rebuild standby with pg_basebackup.
Debugging Thought Process (Systematic Checklist)
Diagnostic Command Collection
# Primary checks
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
psql -U postgres -c "SELECT * FROM pg_replication_slots;"
psql -U postgres -c "SELECT pg_current_wal_lsn();"
# Standby checks
psql -U postgres -c "SELECT pg_is_in_recovery();"
psql -U postgres -c "SELECT * FROM pg_stat_wal_receiver;"
psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
# Network checks
ping -c 10 <standby_ip>
iperf3 -c <primary_ip> -t 10
# Log checks
tail -f /var/log/postgresql/postgresql-15-main.log | grep -E "(replication|recovery|wal)"
# WAL directory checks
du -sh /var/lib/postgresql/15/main/pg_wal
ls -lh /var/lib/postgresql/15/main/pg_wal | tail -20
# Performance checks
iostat -x 1 10
top -H -p $(pgrep postgres)Best Practices
Use replication slots (strongly recommended).
Configure synchronous replication for critical workloads ( synchronous_standby_names, synchronous_commit = on).
Monitor key metrics: replication lag, WAL directory size, standby heartbeat.
Perform regular fault‑injection drills (quarterly primary failure, split‑brain recovery).
Optimize WAL settings: wal_level = replica, max_wal_senders = 10, wal_keep_size = 5GB, max_slot_wal_keep_size = 50GB, wal_compression = on, wal_sender_timeout = 60s.
Standby configuration: hot_standby = on, max_standby_streaming_delay = 300s, wal_receiver_timeout = 60s.
Secure pg_hba.conf for replication connections.
Consider automated HA tools (Patroni, Stolon) for automatic failover and node rebuild.
Implement robust backup strategy: physical base backups + WAL archiving, periodic logical dumps, monthly restore tests.
Cross‑region replication tips: use cascading replication, enable WAL compression, increase timeout parameters.
FAQ
Q1: What is the difference between physical and logical replication? A: Physical replication copies the entire database cluster (read‑only standby). Logical replication copies selected tables or databases and can allow writes on the subscriber if there are no conflicts. This guide focuses on physical replication.
Q2: How much replication lag is considered normal? A: Normal < 100 MB (< 5 s); warning 100 MB – 1 GB (5 s – 1 min); severe > 1 GB (> 1 min).
Q3: Can a standby write data? A: Physical standby is read‑only ( hot_standby = on). Attempts to write return “cannot execute INSERT in a read‑only transaction”.
Q4: How to quickly tell if a node is primary or standby? A: Run SELECT pg_is_in_recovery(); – f = primary, t = standby.
Q5: Difference between replication slots and wal_keep_size ? A: wal_keep_size retains a minimum amount of WAL on the primary when no slots exist. Replication slots retain WAL for a specific standby, preventing premature removal. Using slots is the recommended approach.
Q6: How to promote a standby to primary? A: Use one of the following:
# pg_ctl promote
pg_ctl promote -D /var/lib/postgresql/15/main
# Trigger file (older versions)
touch /tmp/postgresql.trigger.5432
# SQL function (PG 12+)
SELECT pg_promote();Q7: How to prevent split‑brain? A: Deploy HA tools like Patroni or Stolon that use a distributed lock (etcd/Consul), configure a witness node, and require manual confirmation before promotion.
Q8: What problems arise from WAL accumulation? A: Disk exhaustion, PostgreSQL PANIC errors (cannot write WAL), possible system crash.
Q9: Fast way to rebuild a standby? A:
# Stop standby
pg_ctl stop -D /var/lib/postgresql/15/main
# Clean data directory
rm -rf /var/lib/postgresql/15/main/*
# Re‑basebackup from primary
pg_basebackup -h <primary_ip> -U replica -D /var/lib/postgresql/15/main -Fp -Xs -P -R
# Start standby
pg_ctl start -D /var/lib/postgresql/15/mainQ10: Does replication lag cause data loss? A: In asynchronous replication, yes – un‑synced transactions may be lost if the primary fails. Synchronous replication waits for standby acknowledgment, preventing loss at the cost of higher latency.
Further Reading
Official PostgreSQL replication docs: https://www.postgresql.org/docs/current/high-availability.html
WAL configuration: https://www.postgresql.org/docs/current/runtime-config-wal.html
Replication slots: https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-SLOTS
Deep technical blog: https://www.cybertec-postgresql.com/en/
Patroni HA solution: https://github.com/zalando/patroni
Tools & resources: pg_stat_statements, pgBadger, Bucardo (logical multi‑master)
Generated on 2025‑01‑15 using PostgreSQL 15.5 on Ubuntu 22.04 / RHEL 9.3.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
