Achieving Zero Data Loss and High Availability in Finance with PostgreSQL Replication
This article explains how PostgreSQL's streaming replication can provide zero‑data‑loss, high‑availability, and disaster‑recovery for financial systems, detailing feedback and consistency metrics, architecture simplification, configuration examples, performance impact, primary election rules, flexibility, and cost considerations.
Background
The financial industry relies on shared storage for high availability, zero data loss, and remote disaster recovery.
Can PostgreSQL Replication Meet These Requirements?
PostgreSQL’s replication‑based solution provides flexible, reliable zero‑loss HA and DR.
Feedback Metrics (L‑levels)
L1 : Standby receives REDO and writes it to the XLOG buffer.
L2 : Standby receives REDO and persists it to disk.
L3 : After REDO is persisted, the standby replays it.
Higher L‑levels increase transaction latency; configure according to required reliability.
Consistency Metrics
Application is notified of commit success/failure only after both conditions are satisfied:
Strong‑sync feedback: at least one standby in a designated strong‑sync group acknowledges.
At least n standbys (n≥0) acknowledge, supporting arbitrary numbers of strong‑sync replicas.
Asynchronous standby delay (seconds) – if set, the primary switches to read‑only after the delay; if unset, the primary ignores the delay.
Asynchronous standby delay (bytes) – similar switch behavior based on data lag.
For remote DR with zero loss, attach remote asynchronous standbys directly to the primary and place them in the same feedback group, ensuring at least one feedback.
Architecture Simplification
Zero data loss is guaranteed by preserving REDO logs; as long as REDO is retained, the system can recover to a consistent state.
The architecture can be simplified to multi‑group XLOG synchronization, where each group represents a data center.
Replication Groups
Local strong‑sync mode : XLOG data is synchronized to at least one replica within the local data center.
Remote disaster‑recovery strong‑sync mode : XLOG data is synchronized to at least one replica in the remote data center.
Recommended replica counts: two replicas in the local data center and one replica in the remote data center (additional replicas increase reliability).
Configuration Example for Financial Use‑Case
Local data center: 1 primary + 3 standbys, 10 GbE connections. Feedback metrics: 2 × L1 and 1 × L2. Consistency requirement: more than one primary feedback. Primary and remote data centers are linked via dual‑fiber direct connections.
Remote data center: 2 standbys, directly connected to the primary, configured as a strong‑sync group with at least one feedback; all feedback metrics set to L1.
Problems Solved
Two strong‑sync replicas, one located remotely.
Local standby count can increase without affecting primary operations.
Remote standby count can increase without affecting primary operations.
Performance Impact
Read operations are unaffected.
Write latency increase is minimal; in a 10 GbE environment, additional latency is < 1 ms.
Physical streaming replication keeps standby‑primary lag at millisecond levels, avoiding logical replication delays.
HA failover typically completes within 25–45 seconds, accounting for network jitter and load.
Client connections remain uninterrupted during failover using a proxy; only bound variables may need re‑binding.
Primary Re‑Election Rules
Select the standby that has already reached synchronous mode based on configured priority order.
If all standbys are asynchronous, choose the one with the lowest latency.
If latency is equal, fall back to the configured priority order.
After a new primary is elected, connection relationships are re‑established.
Flexibility
Standbys can serve read‑only queries, enabling read/write load balancing.
Timeline switching and role changes are straightforward.
Standbys can be used for major version upgrade rehearsals, rapid test‑environment provisioning, and sample‑database creation—capabilities not possible with shared‑storage solutions.
Performance vs Consistency Trade‑off
If the remote data center is not configured for strong sync, failover to it may risk data loss. Mitigations:
Use PostgreSQL logical decoding to capture SQL statements executed after the remote activation point, allowing application‑level data reconciliation.
Use pg_rewind to quickly revert the former primary to a standby; even multi‑terabyte databases can be rewound within minutes.
Cost Estimate
Assuming a single 10 TB database with 100 GB of REDO retention, the hardware requirements are:
Local data center: 2 data copies (primary + one standby) on 2 servers.
Local XLOG services: 3 copies (reuse the two servers plus one additional).
Remote data center: 1 data copy on 1 server.
Remote XLOG services: 2 copies (reuse the remote server plus one additional).
Total: 5 servers, 30 TB of data storage (3 copies) and 500 GB of REDO storage (5 copies).
Operational benefits:
Two local XLOG receivers; loss of one does not affect primary writes.
Two remote XLOG receivers; loss of one does not affect primary writes.
Automatic failover to a local standby if the primary fails.
Failover to the remote standby if an entire data center fails.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
