Databases 7 min read

When 310 GB Vanished: GitLab’s Backup Failure and What It Teaches Us

A GitLab.com database accident caused the loss of 310 GB of data, exposing multiple failed backup mechanisms and prompting a detailed analysis of technical, operational, and managerial lessons for reliable data protection.

dbaplus Community
dbaplus Community
dbaplus Community
When 310 GB Vanished: GitLab’s Backup Failure and What It Teaches Us

On January 31, a GitLab.com engineer mistakenly ran rm -rf on what he thought was an empty directory, only to realize he had deleted a 310 GB data folder on the production server. Within seconds, 99% of the data was gone.

All existing backup solutions proved ineffective: the scheduled pg_dump script failed because the backup directory vanished after a PostgreSQL upgrade from 9.2 to 9.6; LVM snapshots taken 6 hours earlier did not capture the data; Azure disk snapshots were never created for the database; S3 backups appeared incomplete; and an automatic sync program was unstable.

Recovery progress, tracked via GitLab’s public Twitter updates, reached 73% after about an hour, indicating a non‑backup‑based restoration method, though the exact technique was not disclosed.

The incident highlighted three layers of lessons:

Technical: The database used Slony replication, an unusual choice for high‑availability; mainstream solutions (PostgreSQL native streaming, MySQL, Oracle) would have offered more robust HA configurations.

Operational: The pg_dump script broke after the major version upgrade because the required backup directory no longer existed, underscoring the need for post‑upgrade validation of backup procedures.

Management: Critical changes should follow a “golden‑hand” policy—no single person should execute risky operations alone, especially during off‑hours, and proper shift rotation is essential to avoid fatigue‑related errors.

Overall, the case demonstrates that having multiple “defense‑in‑depth” layers is meaningless without proper implementation, verification, and continuous improvement of backup and recovery processes.

For further technical details, see the incident report at https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub and the monitoring dashboard at http://monitor.gitlab.net/dashboard/db/postgres-stats?panelId=10&fullscreen&from=now-24h&to=now.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsGitLabDatabase BackupDBA
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.