Databases 35 min read

Mastering PostgreSQL Replication: Diagnose Lag, Split‑Brain, and Fix Common Issues

This comprehensive guide walks you through troubleshooting PostgreSQL physical (stream) replication, covering environment prerequisites, anti‑pattern warnings, step‑by‑step diagnostics for replication lag, split‑brain scenarios, replication slot problems, monitoring setup with Prometheus, and best‑practice recommendations to keep your primary‑standby cluster healthy.

Ops Community

Nov 5, 2025

Mastering PostgreSQL Replication: Diagnose Lag, Split‑Brain, and Fix Common Issues

PostgreSQL Master‑Slave Replication Troubleshooting: From Lag to Split‑Brain Diagnosis and Fixes

Applicable Scenarios & Prerequisites

Scenario: PostgreSQL master‑slave streaming replication, physical replication, logical replication troubleshooting

OS: RHEL/CentOS 8.5+ or Ubuntu 20.04+

PostgreSQL version: 12+ (recommended 14+ / 15+ for better replication monitoring)

Hardware: Minimum 4C8G / Recommended 8C16G with SSD

Network: Master‑slave latency < 10 ms, bandwidth > 1 Gbps

Permissions: postgres user, superuser rights

Skills: Familiarity with PostgreSQL architecture, WAL mechanism, replication principles, SQL tuning

Anti‑Pattern Warnings (When Not to Use This Guide)

Logical replication issues – this guide focuses on physical (stream) replication.

Managed cloud databases (AWS RDS, Alibaba Cloud RDS) – cannot directly access WAL files or replication slots.

PostgreSQL < 12 – some monitoring views and parameters are unavailable.

Non‑replication architectures – single‑node PostgreSQL or HA solutions like Pgpool‑II/Patroni require dedicated documentation.

Cross‑version replication – large version gaps (e.g., 12 → 15) may cause compatibility problems.

Alternative Solutions Comparison

Logical replication → use pglogical or native logical replication (different fault‑handling).

HA clusters → Patroni / Stolon (automatic failover mechanisms).

Cloud environments → cloud provider monitoring tools (no OS‑level access).

Cross‑major‑version replication → combine logical replication with pg_upgrade (physical replication does not support it).

Environment & Version Matrix

OS: RHEL 9.3 / CentOS Stream 9, Ubuntu 22.04 LTS (tested)

PostgreSQL: 14.10 (repo) / 15.5 (official packages) – both tested

Minimum specs: 4C8G / 100 GB SSD

Recommended specs: 8C16G / 500 GB NVMe SSD

Network: Gigabit NIC, latency < 10 ms

Version Differences

PostgreSQL 12 → 14: WAL compression (wal_compression) and replication‑slot monitoring views added.

PostgreSQL 14 → 15: Improved logical replication publish/subscribe, WAL format optimizations.

Replication protocol is backward compatible (primary version ≥ standby version).

Reading Navigation

Quick Start (≈30 min): Checklist → Common fault table → Corresponding fix steps

Deep Dive (≈90 min): Minimal principles → Diagnosis flowchart → Full troubleshooting steps → Best practices

Emergency Fault Handling: Fault classification → Diagnostic commands → Quick fix → Permanent solution

Checklist

Preparation

Backup primary and standby configuration files ( postgresql.conf, pg_hba.conf)

Collect replication status ( pg_stat_replication, pg_stat_wal_receiver)

Check WAL directory space ( pg_wal/)

Verify network connectivity (ping between nodes)

Diagnosis

Check primary replication status: SELECT * FROM pg_stat_replication; Check standby receiver status: SELECT * FROM pg_stat_wal_receiver; Calculate replication lag (WAL LSN difference)

Inspect replication slot status

Repair

Apply fix steps based on fault type

Validate replication recovery (lag reduced, data consistent)

Clean up redundant WAL files if needed

Monitoring

Configure Prometheus + postgres_exporter Set alert rules for lag, replication break, slot overflow, WAL directory size, standby reachability

Implementation Steps

Step 1: Replication Status Diagnosis

Goal: Quickly determine whether replication is normal and identify fault type.

1.1 Primary Replication Status Check

Connect to primary: psql -U postgres -h <primary_ip> -d postgres Run diagnostic query:

SELECT pid, usename, application_name, client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, sync_priority, sync_state, reply_time FROM pg_stat_replication;

Typical normal output shows state = streaming, sent_lsn ≈ replay_lsn, and recent reply_time.

1.2 Standby Replication Status Check

Connect to standby and run:

SELECT pid, status, receive_start_lsn, received_lsn, last_msg_receipt_time, slot_name, sender_host FROM pg_stat_wal_receiver;

Normal output shows status = streaming and recent timestamps.

Step 2: Replication Lag Diagnosis & Repair

2.1 Lag Type Classification

2.2 Lag Diagnosis SQL

SELECT application_name, client_addr,
       pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)/1024/1024 AS send_lag_mb,
       pg_wal_lsn_diff(sent_lsn, write_lsn)/1024/1024 AS receive_lag_mb,
       pg_wal_lsn_diff(write_lsn, flush_lsn)/1024/1024 AS flush_lag_mb,
       pg_wal_lsn_diff(flush_lsn, replay_lsn)/1024/1024 AS replay_lag_mb,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)/1024/1024 AS total_lag_mb,
       EXTRACT(EPOCH FROM (now() - reply_time)) AS reply_delay_seconds
FROM pg_stat_replication;

Interpretation: send_lag_mb > 10 → primary WAL sending slow (high load, insufficient max_wal_senders) receive_lag_mb > 10 → network transmission slow (bandwidth, packet loss) replay_lag_mb > 100 → standby apply slow (CPU/IO bottleneck, long transactions)

2.3 Primary‑Side (Send) Issues

Diagnostics:

SELECT count(*) AS active_connections,
       count(*) FILTER (WHERE state = 'active') AS active_queries
FROM pg_stat_activity;

SELECT pg_current_wal_lsn(),
       pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')/1024/1024 AS total_wal_mb;

SELECT count(*) FROM pg_stat_replication;
SHOW max_wal_senders;

Repair steps:

-- Increase max_wal_senders (requires restart)
ALTER SYSTEM SET max_wal_senders = 10;
SELECT pg_reload_conf();

-- Reduce primary load: terminate long‑running queries
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '10 minutes' AND pid != pg_backend_pid();

-- Enable WAL compression (PG 14+)
ALTER SYSTEM SET wal_compression = on;
SELECT pg_reload_conf();

2.4 Network (Receive) Issues

Diagnostics:

# Ping test
ping -c 100 <standby_ip>

# Bandwidth test with iperf3
iperf3 -s   # on primary
iperf3 -c <primary_ip> -t 60   # on standby

# Check retransmissions
netstat -s | grep -i retrans

Repair steps:

ALTER SYSTEM SET wal_sender_timeout = '120s';
SELECT pg_reload_conf();

-- Verify firewall / routing allow full TCP traffic between nodes

-- Enable TCP keepalive in postgresql.conf
tcp_keepalives_idle = 60
tcp_keepalives_interval = 10
tcp_keepalives_count = 3
SELECT pg_reload_conf();

2.5 Standby‑Side (Apply) Issues

Diagnostics:

SELECT pid, usename, application_name, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND pid != pg_backend_pid() ORDER BY query_start;

SELECT pg_blocking_pids(pid) AS blocking_pids, pid, usename, query FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0;

SELECT * FROM pg_stat_bgwriter;

Repair steps:

-- Terminate long‑running queries on standby
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '30 minutes' AND pid != pg_backend_pid();

-- Increase max_standby_streaming_delay
ALTER SYSTEM SET max_standby_streaming_delay = 300s;  -- 5 minutes
SELECT pg_reload_conf();

-- Upgrade hardware (SSD, RAM, CPU) and monitor IO with iostat

-- Enable parallel WAL apply (PG 14+)
ALTER SYSTEM SET recovery_parallelism = 4;
SELECT pg_reload_conf();

Step 3: Split‑Brain Diagnosis & Repair

Scenario: Network partition promotes standby to primary while original primary remains running.

3.1 Split‑Brain Detection

SELECT pg_is_in_recovery();  -- f = primary, t = standby
-- If multiple nodes return f, split‑brain has occurred

SELECT timeline_id, redo_lsn FROM pg_control_checkpoint();  -- Different timeline_id indicates a failover

3.2 Split‑Brain Repair (Rebuild Standby)

# 1. Stop the former primary (now standby)
ssh postgres@<former_primary_ip>
pg_ctl stop -D /var/lib/postgresql/15/main

# 2. Backup its data directory
tar -czf /backup/old_primary_$(date +%Y%m%d).tar.gz /var/lib/postgresql/15/main

# 3. Empty the data directory
rm -rf /var/lib/postgresql/15/main/*

# 4. Re‑basebackup from the new primary
pg_basebackup -h <new_primary_ip> -U replica -D /var/lib/postgresql/15/main -Fp -Xs -P -R

# 5. Verify configuration (primary_conninfo points to new primary)
vim /var/lib/postgresql/15/main/postgresql.auto.conf

# 6. Start the former primary as standby
pg_ctl start -D /var/lib/postgresql/15/main

# 7. Verify replication
psql -U postgres -c "SELECT pg_is_in_recovery();"   -- should return t
psql -U postgres -c "SELECT * FROM pg_stat_wal_receiver;"

Step 4: Replication Slot Fault Diagnosis & Repair

Replication slots prevent the primary from removing WAL files that standbys still need.

4.1 Check Slot Status

SELECT slot_name, slot_type, database, active, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size FROM pg_replication_slots;

Normal output shows active = t and reasonable restart_lsn.

4.2 Slot Full (WAL Accumulation)

Diagnostics:

du -sh /var/lib/postgresql/15/main/pg_wal   # expect < 5 GB, abnormal > 50 GB
ls -1 /var/lib/postgresql/15/main/pg_wal | wc -l   # expect < 100 files, abnormal > 1000

Repair steps:

-- Check slot lag
SELECT slot_name, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)/1024/1024 AS slot_lag_mb FROM pg_replication_slots;

-- Drop inactive slot if standby is permanently offline
SELECT pg_drop_replication_slot('standby1_slot');

-- Increase max_slot_wal_keep_size (PG 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '50GB';
SELECT pg_reload_conf();

-- Optionally clean WAL manually (use with caution)
pg_archivecleanup /var/lib/postgresql/15/main/pg_wal 000000010000000000000050

Step 5: Monitoring & Alerting Configuration

5.1 Prometheus + postgres_exporter

Install exporter:

wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-amd64.tar.gz
 tar -zxvf postgres_exporter-0.15.0.linux-amd64.tar.gz
 cp postgres_exporter /usr/local/bin/

# Create monitoring user
psql -U postgres -c "CREATE USER exporter WITH PASSWORD 'exporter_password';"
psql -U postgres -c "GRANT pg_monitor TO exporter;"

# Start exporter
export DATA_SOURCE_NAME="postgresql://exporter:exporter_password@localhost:5432/postgres?sslmode=disable"
postgres_exporter --web.listen-address=:9187 &

Prometheus scrape config (prometheus.yml):

scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['192.168.1.10:9187']  # primary
        labels:
          role: 'primary'
      - targets: ['192.168.1.11:9187']  # standby
        labels:
          role: 'standby'

5.2 Key Monitoring Metrics

Replication lag (bytes):

pg_stat_replication_pg_wal_lsn_diff{application_name="standby1"}

Replication lag (seconds): pg_replication_lag_seconds Slot lag: pg_replication_slots_pg_wal_lsn_diff WAL file count: pg_stat_wal_files Standby online status: pg_up{role="standby"} Number of streaming connections:

count(pg_stat_replication_state{state="streaming"})

5.3 Alert Rules (postgresql.yml)

groups:
  - name: postgresql_replication
    interval: 30s
    rules:
      - alert: PostgreSQLReplicationLag
        expr: pg_stat_replication_pg_wal_lsn_diff > 100*1024*1024   # >100 MB
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL replication lag too high ({{ $labels.instance }})"
          description: "Standby {{ $labels.application_name }} lag {{ $value | humanize }} B"

      - alert: PostgreSQLReplicationDown
        expr: pg_stat_replication_state{state="streaming"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL replication stopped ({{ $labels.instance }})"
          description: "Standby {{ $labels.application_name }} replication state abnormal"

      - alert: PostgreSQLReplicationSlotFull
        expr: pg_replication_slots_pg_wal_lsn_diff > 10*1024*1024*1024   # >10 GB
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL replication slot WAL accumulation ({{ $labels.instance }})"
          description: "Slot {{ $labels.slot_name }} lag {{ $value | humanize }} B"

      - alert: PostgreSQLWALDirectoryFull
        expr: pg_stat_wal_size_bytes > 50*1024*1024*1024   # >50 GB
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL WAL directory too large ({{ $labels.instance }})"
          description: "pg_wal size {{ $value | humanize }} B"

      - alert: PostgreSQLStandbyUnreachable
        expr: pg_up{role="standby"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL standby unreachable ({{ $labels.instance }})"
          description: "Standby {{ $labels.instance }} offline for >5 min"

Fundamental Principles

WAL (Write‑Ahead Logging)

All data changes are first written to WAL before data files.

WAL files reside in pg_wal/ (formerly pg_xlog/).

Fixed size (default 16 MB) with names like 000000010000000000000001.

Replication Flow

Primary writes WAL → WAL sender reads WAL → streams via TCP → WAL receiver on standby writes to pg_wal/ → WAL apply replays changes → standby data matches primary

LSN (Log Sequence Number) pg_current_wal_lsn() – current WAL position on primary. sent_lsn – position sent to standby. replay_lsn – position applied on standby.

Replication Slots

Prevent primary from removing WAL needed by a standby.

Physical slot: slot_type = physical.

Logical slot: slot_type = logical.

Long‑offline standby can cause WAL accumulation.

Timeline ID

Incremented on each failover (primary promotion).

Different timelines cannot replicate directly; standby must be rebuilt.

Why Does Replication Lag Occur?

Primary writes faster than standby can apply (high TPS).

Network bandwidth or latency issues (cross‑region replication).

Standby resource bottlenecks (CPU/IO, lock contention, long queries).

Split‑Brain Risks

Network partition promotes standby to primary while original primary stays up.

Both nodes accept writes → data divergence.

Recovery requires manual selection of a single primary and rebuilding the other.

Common Faults & Debugging

Replication lag > 100 MB

Symptoms: high lag values.

Root causes: primary overload, standby apply bottleneck, network delay.

Quick fix: pause some writes, kill long queries on standby.

Permanent fix: upgrade standby hardware, optimize network, enable WAL compression.

Standby cannot connect

Symptoms: no rows in pg_stat_replication.

Root causes: standby not started, network blocked, missing pg_hba.conf entry.

Quick fix: start standby, open firewall.

Permanent fix: configure auto‑restart, adjust pg_hba.conf.

WAL directory too large

Symptoms: du -sh pg_wal/ shows huge size.

Root causes: replication slot full, standby offline long, archive command failure.

Quick fix: drop unused slots, restart primary.

Permanent fix: set max_slot_wal_keep_size, monitor slot status.

Split‑brain (multiple primaries)

Detection: SELECT pg_is_in_recovery() returns false on more than one node.

Root cause: network partition + automatic failover.

Quick fix: manually pick one primary, demote others.

Permanent fix: use quorum/consensus HA tools (Patroni, Stolon) with a witness node.

Replication slot inactive

Cause: standby mis‑configured primary_slotname or slot name mismatch.

Fix: correct primary_slotname in postgresql.auto.conf.

Timeline mismatch

Cause: failover created different timeline IDs.

Fix: rebuild standby with pg_basebackup.

Debugging Thought Process (Systematic Checklist)

Diagnostic Command Collection

# Primary checks
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
psql -U postgres -c "SELECT * FROM pg_replication_slots;"
psql -U postgres -c "SELECT pg_current_wal_lsn();"

# Standby checks
psql -U postgres -c "SELECT pg_is_in_recovery();"
psql -U postgres -c "SELECT * FROM pg_stat_wal_receiver;"
psql -U postgres -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"

# Network checks
ping -c 10 <standby_ip>
iperf3 -c <primary_ip> -t 10

# Log checks
tail -f /var/log/postgresql/postgresql-15-main.log | grep -E "(replication|recovery|wal)"

# WAL directory checks
du -sh /var/lib/postgresql/15/main/pg_wal
ls -lh /var/lib/postgresql/15/main/pg_wal | tail -20

# Performance checks
iostat -x 1 10
top -H -p $(pgrep postgres)

Best Practices

Use replication slots (strongly recommended).

Configure synchronous replication for critical workloads ( synchronous_standby_names, synchronous_commit = on).

Monitor key metrics: replication lag, WAL directory size, standby heartbeat.

Perform regular fault‑injection drills (quarterly primary failure, split‑brain recovery).

Optimize WAL settings: wal_level = replica, max_wal_senders = 10, wal_keep_size = 5GB, max_slot_wal_keep_size = 50GB, wal_compression = on, wal_sender_timeout = 60s.

Standby configuration: hot_standby = on, max_standby_streaming_delay = 300s, wal_receiver_timeout = 60s.

Secure pg_hba.conf for replication connections.

Consider automated HA tools (Patroni, Stolon) for automatic failover and node rebuild.

Implement robust backup strategy: physical base backups + WAL archiving, periodic logical dumps, monthly restore tests.

Cross‑region replication tips: use cascading replication, enable WAL compression, increase timeout parameters.

FAQ

Q1: What is the difference between physical and logical replication? A: Physical replication copies the entire database cluster (read‑only standby). Logical replication copies selected tables or databases and can allow writes on the subscriber if there are no conflicts. This guide focuses on physical replication.

Q2: How much replication lag is considered normal? A: Normal < 100 MB (< 5 s); warning 100 MB – 1 GB (5 s – 1 min); severe > 1 GB (> 1 min).

Q3: Can a standby write data? A: Physical standby is read‑only ( hot_standby = on). Attempts to write return “cannot execute INSERT in a read‑only transaction”.

Q4: How to quickly tell if a node is primary or standby? A: Run SELECT pg_is_in_recovery(); – f = primary, t = standby.

Q5: Difference between replication slots and wal_keep_size ? A: wal_keep_size retains a minimum amount of WAL on the primary when no slots exist. Replication slots retain WAL for a specific standby, preventing premature removal. Using slots is the recommended approach.

Q6: How to promote a standby to primary? A: Use one of the following:

# pg_ctl promote
pg_ctl promote -D /var/lib/postgresql/15/main

# Trigger file (older versions)
touch /tmp/postgresql.trigger.5432

# SQL function (PG 12+)
SELECT pg_promote();

Q7: How to prevent split‑brain? A: Deploy HA tools like Patroni or Stolon that use a distributed lock (etcd/Consul), configure a witness node, and require manual confirmation before promotion.

Q8: What problems arise from WAL accumulation? A: Disk exhaustion, PostgreSQL PANIC errors (cannot write WAL), possible system crash.

Q9: Fast way to rebuild a standby? A:

# Stop standby
pg_ctl stop -D /var/lib/postgresql/15/main

# Clean data directory
rm -rf /var/lib/postgresql/15/main/*

# Re‑basebackup from primary
pg_basebackup -h <primary_ip> -U replica -D /var/lib/postgresql/15/main -Fp -Xs -P -R

# Start standby
pg_ctl start -D /var/lib/postgresql/15/main

Q10: Does replication lag cause data loss? A: In asynchronous replication, yes – un‑synced transactions may be lost if the primary fails. Synchronous replication waits for standby acknowledgment, preventing loss at the cost of higher latency.

PostgreSQL Master‑Slave Replication Troubleshooting: From Lag to Split‑Brain Diagnosis and Fixes

Applicable Scenarios & Prerequisites

Anti‑Pattern Warnings (When Not to Use This Guide)

Alternative Solutions Comparison

Environment & Version Matrix

Version Differences

Reading Navigation

Checklist

Implementation Steps

Step 1: Replication Status Diagnosis

1.1 Primary Replication Status Check

1.2 Standby Replication Status Check

Step 2: Replication Lag Diagnosis & Repair

2.1 Lag Type Classification

2.2 Lag Diagnosis SQL

2.3 Primary‑Side (Send) Issues

2.4 Network (Receive) Issues

2.5 Standby‑Side (Apply) Issues

Step 3: Split‑Brain Diagnosis & Repair

3.1 Split‑Brain Detection

3.2 Split‑Brain Repair (Rebuild Standby)

Step 4: Replication Slot Fault Diagnosis & Repair

4.1 Check Slot Status

4.2 Slot Full (WAL Accumulation)

Step 5: Monitoring & Alerting Configuration

5.1 Prometheus + postgres_exporter

5.2 Key Monitoring Metrics

5.3 Alert Rules (postgresql.yml)

Fundamental Principles

Why Does Replication Lag Occur?

Split‑Brain Risks

Common Faults & Debugging

Debugging Thought Process (Systematic Checklist)

Diagnostic Command Collection

Best Practices

FAQ

Further Reading

Ops Community

How this landed with the community

Was this worth your time?

0 Comments