GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned
GitLab.com suffered a catastrophic database outage on February 1, 2017 when an exhausted operator mistakenly ran a destructive rm command on the wrong server, wiping most production data; the incident’s timeline, root causes, recovery steps, and lessons learned are detailed in this post‑mortem.
On February 1, 2017 GitLab.com suffered a major database incident caused by an operator mistakenly running rm -rf / on the wrong PostgreSQL server, deleting most of the production data.
GitLab is a popular open‑source Git hosting platform; the company also offers a SaaS service at GitLab.com.
Good news: The database was restored at 00:14 UTC on February 2.
Bad news: Six hours of production data (projects, users, issues, merge requests) were lost; code and wiki files stored on the filesystem were unaffected.
Incident timeline – a concise recap
GitLab was attacked via malicious snippets, causing PostgreSQL replication issues.
An exhausted operator executed rm -rf / on a healthy DB server instead of the faulty one, reducing 300 GB to 4.5 GB.
A manual LVM snapshot taken shortly before the failure saved a recovery point.
Phase 1 – Detection
At 18:00 UTC on 31 Jan, spam‑sending IPs created snippets that destabilised the database. By 21:00 UTC the database could no longer accept writes and went down.
Actions taken
Blocked the spammer’s IP address.
Removed a user account that was generating massive load (≈ 47 000 IPs).
Deleted the user responsible for the malicious snippets.
Phase 2 – Replication lag
At 22:00 UTC the replication lag grew to several gigabytes because a high‑volume write burst was not processed.
Actions taken
Attempted to repair the lagging replica (db2).
Cleared /var/opt/gitlab/postgresql/data on the replica to force a clean sync.
Increased max_wal_senders to 32 and restarted PostgreSQL.
Reduced max_connections from 8000 to 2000 and restarted.
Handled numerous PostgreSQL errors (signal overload, connection refusals).
Phase 3 – Accidental deletion
At 23:00 UTC the engineer realized he had run the delete command on the production server (db1) instead of the problematic replica (db2). The operation stopped at 23:27, but only 4.5 GB of the original 300 GB remained.
GitLab.com was taken offline and a status tweet announced “Database emergency maintenance…”.
Problems uncovered
LVM snapshots run every 24 h; a manual snapshot taken six hours earlier saved the recovery point.
Daily backups (gitlab‑rake gitlab:backup:create) were mis‑configured and did not contain useful data.
pg_dump failed because workers were using PostgreSQL 9.2 binaries while the cluster required 9.6.
Fog gem cleanup removed earlier backups.
Only NFS servers had disk snapshots; DB servers did not.
Sync scripts deleted webhooks during pre‑release data sync.
Replication relied on fragile hand‑written shell scripts without proper documentation.
S3 backups also failed.
The five‑level backup/replication strategy proved ineffective; only the manual LVM snapshot allowed recovery to a point six hours before the incident.
pg_basebackup required ~10 minutes to initialise replication, causing further delays.
Recovery
Data were restored from the pre‑release environment backup:
00:36 UTC – Backed up db1.staging.gitlab.com.
00:55 UTC – Mounted the backup to db1.cluster.gitlab.com.
Copied data from /var/opt/gitlab/postgresql/data in staging to production.
01:05 UTC – NFS server repurposed as temporary storage.
01:18 UTC – Completed copy of remaining data, including pg_xlog, into 20170131-db-meltodwn-backup.tar.gz.
The restoration was streamed live on YouTube, and engineers sought community help to accelerate the slow disk‑I/O bound process.
https://www.youtube.com/watch?v=nc0hPGerSd4 https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/Key take‑aways
Never operate a database while fatigued or under the influence.
Alias dangerous commands like rm to a safer wrapper.
Test backups regularly by performing full restore drills.
Adopt a blameless post‑mortem culture focused on root‑cause analysis and improvement.
Consider secondary effects of emergency actions; always double‑check before executing.
Maintain a robust incident‑response plan with adequate spare hardware.
Avoid adding extra approval steps during crisis mitigation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
