Operations 15 min read

GitLab Database Deletion Incident: Lessons on Backup, Operations, and High‑Availability Design

The article recounts a GitLab production database deletion caused by a mistaken command, analyzes why the backup mechanisms failed, and offers technical and cultural recommendations—including automation, proper replication, and transparent post‑mortems—to build more reliable, high‑availability systems.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
GitLab Database Deletion Incident: Lessons on Backup, Operations, and High‑Availability Design

On the fifth day of the Chinese New Year, a GitLab.com operator mistakenly deleted the production PostgreSQL database while trying to fix a staging replica, exposing a cascade of operational failures and data loss.

The post‑mortem reviews the incident timeline: a DDoS‑induced load spike, a lagging staging replica, an erroneous rm command run against the primary cluster, and the subsequent recovery that could only use a six‑hour‑old staging copy.

Five backup strategies were examined and found inadequate: missing webhook sync, lack of disk snapshots, version‑mismatched pg_dump backups, absent S3 backups, and brittle, manually‑crafted scripts with poor documentation.

External experts from 2nd Quadrant suggested improvements such as fixing PostgreSQL 9.6 replication bugs, reducing max_wal_senders and max_connections settings, using pg_basebackup correctly, adopting automated tools like repmgr , barman , and regularly testing backup‑restore procedures.

The author reflects on the broader operational culture, arguing that manual, ad‑hoc commands on production indicate weak operational maturity, and that automation, code‑driven changes, and robust continuous‑delivery pipelines are essential.

Backup systems must be continuously live and part of a distributed, highly available architecture; otherwise, any failure—whether human error, power loss, disk crash, or malware—will still cause data loss.

Transparent incident disclosure, such as GitLab’s public blog and live streams, is highlighted as a best practice for building trust and enabling community‑driven improvements.

OperationsDatabaseHigh AvailabilitygitlabPostgreSQLBackuppostmortem
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.