Databases 16 min read

How GitLab Achieved a Near-Perfect PostgreSQL 9.6→11 Upgrade

In May 2020 GitLab partnered with OnGres to upgrade a 12‑node PostgreSQL 9.6 cluster to version 11, using a carefully planned pg_upgrade process, automated Ansible playbooks, Patroni HA, and a detailed rollback strategy to keep a 6 TB dataset consistent while serving 300 k transactions per second.

dbaplus Community
dbaplus Community
dbaplus Community
How GitLab Achieved a Near-Perfect PostgreSQL 9.6→11 Upgrade

Why Upgrade PostgreSQL

GitLab 13.0 discontinued support for PostgreSQL 10, and PostgreSQL 9.6 reached end‑of‑life in November 2021, prompting a migration to PostgreSQL 11 to maintain service continuity for millions of users.

Key Differences Between PostgreSQL 9.6 and 11

Table partitioning with LIST, RANGE, and HASH.

Procedural language support for transactions.

Just‑in‑time (JIT) compilation for faster query execution.

Parallel query execution and enhanced parallel DDL.

Logical replication framework inherited from version 10.

Quorum‑based commit handling.

Improved performance on partitioned tables.

Environment and Architecture

The production cluster runs on Google Cloud Platform with twelve n1‑highmem‑96 instances (each 96 CPU cores, 614 GB RAM) serving OLTP workloads, plus two BI nodes. Patroni manages HA, while Consul provides DNS‑based service discovery for leader election and replication slots.

The cluster processes roughly 181 000 transactions per second on average, peaking at 250 000 TPS, with up to 60 000 concurrent connections.

Upgrade Requirements

No regression on PostgreSQL 11; custom benchmark suite used to detect query performance regressions.

Full upgrade must be completed within a maintenance window.

Use pg_upgrade based on physical file migration, not logical dump/restore.

Retain a sample of the 9.6 cluster for rollback.

Automation is mandatory to minimise human error.

Maintenance window limited to 30 minutes.

All steps must be documented and published.

Project Phases

Phase 1 – Closed‑environment automation development

Develop Ansible playbooks and test them on a staged PostgreSQL backup.

Phase 2 – Staging integration of upgrade and configuration management

Integrate configuration management with Chef and create snapshot‑based rollbacks.

Notify users about the upcoming maintenance.

Run end‑to‑end tests in staging.

Phase 3 – End‑to‑end upgrade testing in staging

Pre‑upgrade checks, stop all traffic, and run pg_upgrade on selected nodes.

Validate data consistency with automated tests.

Rollback to 9.6 if necessary and prepare nodes for the next test run.

Phase 4 – Production upgrade

Pre‑upgrade checks and announce maintenance start.

Run Ansible playbooks to stop traffic, stop HA‑proxy, and halt middleware (Sidekiq, Workhorse, WEB‑API).

Execute pg_upgrade on leader and secondary nodes using hard‑link mode to avoid copying 6 TB of data.

Collect post‑upgrade metrics, sync configuration with Chef, and verify cluster health.

Take GCP snapshots and, if needed, perform rollback steps.

pg_upgrade Mechanics

pg_upgrade

upgrades data files in‑place, avoiding a full dump/reload and reducing downtime. The upgrade was performed in hard‑link mode on the leader node, which required keeping a 9.6 backup and GCP snapshots as a rollback path.

External extensions (e.g., postgres_exporter) were removed before the upgrade and re‑installed afterward to ensure binary compatibility.

Regression Testing Benchmark

Before the production upgrade, a regression benchmark using JMeter was run on both PostgreSQL 9.6 and 11 clusters, storing results in pg_stat_statement for performance comparison.

Automation and Tooling

All automation was implemented with Ansible 2.9 playbooks, Terraform, and Chef. Two main playbooks were used:

One to control traffic and stop services (Cloudflare maintenance mode, HA‑proxy, Sidekiq, Workhorse, WEB‑API).

Another to execute the upgrade steps, coordinate Patroni and Consul, run pg_upgrade, collect statistics, sync configuration, and handle snapshots and possible rollback.

Pre‑Upgrade Preparation

Prior to the upgrade, a subset of the 12‑node cluster was reserved for rollback (four 9.6 nodes). Services dependent on PostgreSQL (PgBouncer, Chef client, Patroni) were stopped, and a consistent GCP snapshot was taken.

Upgrade Execution

Stop all nodes.

Run version checks and ensure no traffic is accepted.

Execute pg_upgrade on the leader, then rsync data to replicas.

Restart Patroni to bring replicas into the new cluster.

Install new binaries via Chef and re‑enable extensions.

Resume traffic, run vacuum, and restart PgBouncer and Chef client.

Migration Day

The upgrade window began at 08:45 UTC on a Sunday, with a maximum of two hours of downtime. The entire process took four hours, including two hours of service interruption. A video of the full upgrade was published on GitLab Unfiltered.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

GitLabPostgreSQLdatabase migrationHAAnsiblePatronipg_upgrade
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.