How GitLab Achieved a Near-Perfect PostgreSQL 9.6→11 Upgrade
In May 2020 GitLab partnered with OnGres to upgrade a 12‑node PostgreSQL 9.6 cluster to version 11, using a carefully planned pg_upgrade process, automated Ansible playbooks, Patroni HA, and a detailed rollback strategy to keep a 6 TB dataset consistent while serving 300 k transactions per second.
Why Upgrade PostgreSQL
GitLab 13.0 discontinued support for PostgreSQL 10, and PostgreSQL 9.6 reached end‑of‑life in November 2021, prompting a migration to PostgreSQL 11 to maintain service continuity for millions of users.
Key Differences Between PostgreSQL 9.6 and 11
Table partitioning with LIST, RANGE, and HASH.
Procedural language support for transactions.
Just‑in‑time (JIT) compilation for faster query execution.
Parallel query execution and enhanced parallel DDL.
Logical replication framework inherited from version 10.
Quorum‑based commit handling.
Improved performance on partitioned tables.
Environment and Architecture
The production cluster runs on Google Cloud Platform with twelve n1‑highmem‑96 instances (each 96 CPU cores, 614 GB RAM) serving OLTP workloads, plus two BI nodes. Patroni manages HA, while Consul provides DNS‑based service discovery for leader election and replication slots.
The cluster processes roughly 181 000 transactions per second on average, peaking at 250 000 TPS, with up to 60 000 concurrent connections.
Upgrade Requirements
No regression on PostgreSQL 11; custom benchmark suite used to detect query performance regressions.
Full upgrade must be completed within a maintenance window.
Use pg_upgrade based on physical file migration, not logical dump/restore.
Retain a sample of the 9.6 cluster for rollback.
Automation is mandatory to minimise human error.
Maintenance window limited to 30 minutes.
All steps must be documented and published.
Project Phases
Phase 1 – Closed‑environment automation development
Develop Ansible playbooks and test them on a staged PostgreSQL backup.
Phase 2 – Staging integration of upgrade and configuration management
Integrate configuration management with Chef and create snapshot‑based rollbacks.
Notify users about the upcoming maintenance.
Run end‑to‑end tests in staging.
Phase 3 – End‑to‑end upgrade testing in staging
Pre‑upgrade checks, stop all traffic, and run pg_upgrade on selected nodes.
Validate data consistency with automated tests.
Rollback to 9.6 if necessary and prepare nodes for the next test run.
Phase 4 – Production upgrade
Pre‑upgrade checks and announce maintenance start.
Run Ansible playbooks to stop traffic, stop HA‑proxy, and halt middleware (Sidekiq, Workhorse, WEB‑API).
Execute pg_upgrade on leader and secondary nodes using hard‑link mode to avoid copying 6 TB of data.
Collect post‑upgrade metrics, sync configuration with Chef, and verify cluster health.
Take GCP snapshots and, if needed, perform rollback steps.
pg_upgrade Mechanics
pg_upgradeupgrades data files in‑place, avoiding a full dump/reload and reducing downtime. The upgrade was performed in hard‑link mode on the leader node, which required keeping a 9.6 backup and GCP snapshots as a rollback path.
External extensions (e.g., postgres_exporter) were removed before the upgrade and re‑installed afterward to ensure binary compatibility.
Regression Testing Benchmark
Before the production upgrade, a regression benchmark using JMeter was run on both PostgreSQL 9.6 and 11 clusters, storing results in pg_stat_statement for performance comparison.
Automation and Tooling
All automation was implemented with Ansible 2.9 playbooks, Terraform, and Chef. Two main playbooks were used:
One to control traffic and stop services (Cloudflare maintenance mode, HA‑proxy, Sidekiq, Workhorse, WEB‑API).
Another to execute the upgrade steps, coordinate Patroni and Consul, run pg_upgrade, collect statistics, sync configuration, and handle snapshots and possible rollback.
Pre‑Upgrade Preparation
Prior to the upgrade, a subset of the 12‑node cluster was reserved for rollback (four 9.6 nodes). Services dependent on PostgreSQL (PgBouncer, Chef client, Patroni) were stopped, and a consistent GCP snapshot was taken.
Upgrade Execution
Stop all nodes.
Run version checks and ensure no traffic is accepted.
Execute pg_upgrade on the leader, then rsync data to replicas.
Restart Patroni to bring replicas into the new cluster.
Install new binaries via Chef and re‑enable extensions.
Resume traffic, run vacuum, and restart PgBouncer and Chef client.
Migration Day
The upgrade window began at 08:45 UTC on a Sunday, with a maximum of two hours of downtime. The entire process took four hours, including two hours of service interruption. A video of the full upgrade was published on GitLab Unfiltered.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
