Migrating PostgreSQL 9.6 to 12.4 with Minimal Downtime at Coffee Meets Bagel
This article describes how Coffee Meets Bagel upgraded a five‑node PostgreSQL 9.6 cluster to version 12.4 on AWS using pglogical logical replication, achieving less than 30 minutes of total downtime while sharing the architecture, migration steps, and lessons learned.
In November 2020 Coffee Meets Bagel began a large‑scale migration, upgrading six PostgreSQL servers (one primary, three read‑only replicas behind HAProxy, one async worker, and one ETL/BI node) from version 9.6 to 12.4 on AWS i3.8xlarge instances.
The existing cluster was aging, with a primary that had been online for over three years, disk usage approaching 75 % on a 7.6 TB NVMe volume, and performance issues such as CPU spikes and SSH unresponsiveness.
Key requirements were to keep cumulative downtime under four hours, build a new cluster on larger i3.16xlarge instances, and replace the old servers without data loss.
After discarding backup‑restore (the 5.7 TB dataset would take too long) and rejecting pg_upgrade (it is in‑place and does not satisfy the new‑instance requirement), the team chose logical replication with pglogical.
A new PostgreSQL 12 server was created as the target; pglogical streamed all data from the old primary, after which streaming replicas were added one by one, each promoted in HAProxy while the corresponding old replica was retired, until only the new primary remained.
The final cut‑over was performed during a maintenance window: the site was placed in maintenance mode, the primary DNS was switched to the new server’s IP, primary key sequences were forced to sync, a manual checkpoint was run on the old primary, data validation tests were executed, and the site was brought back online.
Lessons learned include the danger of slow synchronization—initial COPY of a 4 TB table generated ~1 TB of WAL on the old primary, forcing the team to drop indexes, disable fsync, increase max_wal_size to 50 GB, and set checkpoint_timeout to 1 hour, which reduced sync time to under eight hours on the second attempt—and the flood of conflict logs generated by pglogical, which was mitigated by setting pglogical.conflict_log_level = DEBUG in postgresql.conf .
Overall, the migration succeeded without unexpected downtime, demonstrating a reliable approach for large PostgreSQL version upgrades.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.