Achieving Zero Data Loss and High Availability in Finance with PostgreSQL Replication
This article explains how PostgreSQL's streaming replication can provide zero data loss, high availability, and disaster‑recovery for financial institutions, detailing feedback and consistency metrics, architecture simplifications, configuration examples, performance impact, failover rules, flexibility, and cost considerations.
Background
Financial institutions require zero data loss, high availability, and cross‑site disaster recovery. PostgreSQL’s physical streaming replication can satisfy these requirements when configured with appropriate synchronous‑standby groups and XLOG (write‑ahead log) replication.
Reliability (Feedback) Metrics
L1 : Standby receives WAL (REDO) and writes it to the XLOG buffer.
L2 : Standby receives WAL and flushes it to disk.
L3 : Standby receives WAL, flushes it, and replays the changes.
Higher‑level metrics increase transaction latency but provide stronger durability guarantees.
Consistency Metrics
Transaction commit is reported to the client only after both conditions are met:
At least one standby in the configured strong‑sync group acknowledges receipt of WAL.
Or a configurable number n of standbys acknowledge (n=0 for async, n=1 for one‑standby strong sync, n=2 for two‑standby strong sync, etc.).
If an asynchronous standby lags by a configured number of seconds, the primary can be forced into read‑only mode (optional).
If an asynchronous standby lags by a configured number of bytes, the primary can be forced into read‑only mode (optional).
Architecture Simplification
Zero data loss is guaranteed as long as the WAL is retained. The topology can be expressed as multiple XLOG groups, each representing a data‑center (local and remote). Within each group at least one XLOG copy must be synchronized.
Local strong‑sync mode: at least one XLOG replica synchronized inside the local data‑center.
Remote disaster‑recovery strong‑sync mode: at least one XLOG replica synchronized to the remote data‑center.
Data‑replica recommendation: two copies in the local site, one copy in the remote site (additional copies can be added for higher reliability).
Configuration Example
Typical deployment for a high‑availability financial environment:
Local data‑center : 1 primary, 3 standbys, 10 GbE interconnect.
Feedback metrics: two standbys configured as L1, one standby as L2.
Consistency requirement: at least one primary‑side feedback (i.e., one synchronous acknowledgment).
Remote data‑center : 2 standbys directly connected to the primary, forming a strong‑sync group with a minimum of one acknowledgment; all standbys use L1 feedback.
Problems Addressed
Provides two strong‑sync replicas, one of which resides in a remote site.
Local site can host three standbys without impacting primary write performance.
Remote site can host a standby without affecting primary operations.
Performance Impact
Read queries are unaffected because they can be served by any standby.
Write latency is dominated by network round‑trip; in a 10 GbE environment the additional latency is typically < 1 ms.
Physical streaming replication keeps primary‑standby lag at the millisecond level, avoiding the larger delays seen with logical replication.
Failover switchover usually completes within 25–45 seconds, accounting for network jitter and load.
HA failover can be transparent to applications when a proxy (e.g., PgBouncer) is used; only session‑level state such as bound variables may need re‑initialisation.
Primary Re‑election Rules
Select the highest‑priority standby that has already reached synchronous mode.
If all standbys are asynchronous, choose the one with the smallest replication lag.
If lag is equal, fall back to the configured priority order.
Flexibility
Standbys can serve read‑only workloads, enabling read/write split load balancing.
Timeline switches and role changes between primary and standby are straightforward.
Standbys can be used for major version upgrade rehearsals, rapid test‑environment provisioning, and sample‑database creation—capabilities not available with shared‑storage solutions.
Performance vs. Consistency Trade‑off
If the remote site is configured for asynchronous replication, a total loss of the primary site can cause data loss during failover. Two common questions arise:
How to back‑fill data that was not fully synchronized to the remote site?
How to promote the former primary to a standby without a full resynchronization?
PostgreSQL provides two built‑in tools: logical decoding can extract all SQL statements executed after the remote activation point, allowing application‑level reconciliation of missing rows. pg_rewind can revert a former primary to a standby within minutes, even for large databases, by applying the XLOG generated after the activation point. This operation is not available in many commercial databases.
Cost Estimate (Example: 10 TB Database, 100 GB WAL Retention)
Local site: 2 servers holding the primary and one standby (data copies).
Local XLOG: 3 servers (reuse the two data servers and add a third for WAL replication).
Remote site: 1 server holding a data copy.
Remote XLOG: 2 servers (reuse the remote data server and add a second for WAL replication).
Total hardware: 5 servers, 30 TB of data storage (three full data copies) and 500 GB of WAL storage (five copies).
Resulting benefits:
Two local XLOG receivers; loss of one does not affect primary writes.
Two remote XLOG receivers; loss of one does not affect primary writes.
Automatic failover to a local standby if the primary fails.
Failover to the remote standby if an entire site fails.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
