When 310 GB Vanished: GitLab’s Backup Failure and What It Teaches Us
A GitLab.com database accident caused the loss of 310 GB of data, exposing multiple failed backup mechanisms and prompting a detailed analysis of technical, operational, and managerial lessons for reliable data protection.
On January 31, a GitLab.com engineer mistakenly ran rm -rf on what he thought was an empty directory, only to realize he had deleted a 310 GB data folder on the production server. Within seconds, 99% of the data was gone.
All existing backup solutions proved ineffective: the scheduled pg_dump script failed because the backup directory vanished after a PostgreSQL upgrade from 9.2 to 9.6; LVM snapshots taken 6 hours earlier did not capture the data; Azure disk snapshots were never created for the database; S3 backups appeared incomplete; and an automatic sync program was unstable.
Recovery progress, tracked via GitLab’s public Twitter updates, reached 73% after about an hour, indicating a non‑backup‑based restoration method, though the exact technique was not disclosed.
The incident highlighted three layers of lessons:
Technical: The database used Slony replication, an unusual choice for high‑availability; mainstream solutions (PostgreSQL native streaming, MySQL, Oracle) would have offered more robust HA configurations.
Operational: The pg_dump script broke after the major version upgrade because the required backup directory no longer existed, underscoring the need for post‑upgrade validation of backup procedures.
Management: Critical changes should follow a “golden‑hand” policy—no single person should execute risky operations alone, especially during off‑hours, and proper shift rotation is essential to avoid fatigue‑related errors.
Overall, the case demonstrates that having multiple “defense‑in‑depth” layers is meaningless without proper implementation, verification, and continuous improvement of backup and recovery processes.
For further technical details, see the incident report at https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub and the monitoring dashboard at http://monitor.gitlab.net/dashboard/db/postgres-stats?panelId=10&fullscreen&from=now-24h&to=now.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
