GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned
GitLab.com suffered a catastrophic database outage on February 1, 2017 when an exhausted operator mistakenly ran a destructive rm command on the wrong server, wiping most production data; the incident’s timeline, root causes, recovery steps, and lessons learned are detailed in this post‑mortem.
On February 1, 2017 GitLab.com suffered a major database incident caused by an operator mistakenly running
rm -rf /on the wrong PostgreSQL server, deleting most of the production data.
GitLab is a popular open‑source Git hosting platform; the company also offers a SaaS service at GitLab.com.
Good news: The database was restored at 00:14 UTC on February 2.
Bad news: Six hours of production data (projects, users, issues, merge requests) were lost; code and wiki files stored on the filesystem were unaffected.
Incident timeline – a concise recap
GitLab was attacked via malicious snippets, causing PostgreSQL replication issues.
An exhausted operator executed
rm -rf /on a healthy DB server instead of the faulty one, reducing 300 GB to 4.5 GB.
A manual LVM snapshot taken shortly before the failure saved a recovery point.
Phase 1 – Detection
At 18:00 UTC on 31 Jan, spam‑sending IPs created snippets that destabilised the database. By 21:00 UTC the database could no longer accept writes and went down.
Actions taken
Blocked the spammer’s IP address.
Removed a user account that was generating massive load (≈ 47 000 IPs).
Deleted the user responsible for the malicious snippets.
Phase 2 – Replication lag
At 22:00 UTC the replication lag grew to several gigabytes because a high‑volume write burst was not processed.
Actions taken
Attempted to repair the lagging replica (db2).
Cleared
/var/opt/gitlab/postgresql/dataon the replica to force a clean sync.
Increased
max_wal_sendersto 32 and restarted PostgreSQL.
Reduced
max_connectionsfrom 8000 to 2000 and restarted.
Handled numerous PostgreSQL errors (signal overload, connection refusals).
Phase 3 – Accidental deletion
At 23:00 UTC the engineer realized he had run the delete command on the production server (db1) instead of the problematic replica (db2). The operation stopped at 23:27, but only 4.5 GB of the original 300 GB remained.
GitLab.com was taken offline and a status tweet announced “Database emergency maintenance…”.
Problems uncovered
LVM snapshots run every 24 h; a manual snapshot taken six hours earlier saved the recovery point.
Daily backups (gitlab‑rake gitlab:backup:create) were mis‑configured and did not contain useful data.
pg_dump failed because workers were using PostgreSQL 9.2 binaries while the cluster required 9.6.
Fog gem cleanup removed earlier backups.
Only NFS servers had disk snapshots; DB servers did not.
Sync scripts deleted webhooks during pre‑release data sync.
Replication relied on fragile hand‑written shell scripts without proper documentation.
S3 backups also failed.
The five‑level backup/replication strategy proved ineffective; only the manual LVM snapshot allowed recovery to a point six hours before the incident.
pg_basebackup required ~10 minutes to initialise replication, causing further delays.
Recovery
Data were restored from the pre‑release environment backup:
00:36 UTC – Backed up
db1.staging.gitlab.com.
00:55 UTC – Mounted the backup to
db1.cluster.gitlab.com.
Copied data from
/var/opt/gitlab/postgresql/datain staging to production.
01:05 UTC – NFS server repurposed as temporary storage.
01:18 UTC – Completed copy of remaining data, including
pg_xlog, into
20170131-db-meltodwn-backup.tar.gz.
The restoration was streamed live on YouTube, and engineers sought community help to accelerate the slow disk‑I/O bound process.
<code>https://www.youtube.com/watch?v=nc0hPGerSd4</code> <code>https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/</code>Key take‑aways
Never operate a database while fatigued or under the influence.
Alias dangerous commands like
rmto a safer wrapper.
Test backups regularly by performing full restore drills.
Adopt a blameless post‑mortem culture focused on root‑cause analysis and improvement.
Consider secondary effects of emergency actions; always double‑check before executing.
Maintain a robust incident‑response plan with adequate spare hardware.
Avoid adding extra approval steps during crisis mitigation.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.