Operations 12 min read

GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned

GitLab.com suffered a catastrophic database outage on February 1, 2017 when an exhausted operator mistakenly ran a destructive rm command on the wrong server, wiping most production data; the incident’s timeline, root causes, recovery steps, and lessons learned are detailed in this post‑mortem.

Efficient Ops
Efficient Ops
Efficient Ops
GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned

On February 1, 2017 GitLab.com suffered a major database incident caused by an operator mistakenly running

rm -rf /

on the wrong PostgreSQL server, deleting most of the production data.

GitLab is a popular open‑source Git hosting platform; the company also offers a SaaS service at GitLab.com.

Good news: The database was restored at 00:14 UTC on February 2.

Bad news: Six hours of production data (projects, users, issues, merge requests) were lost; code and wiki files stored on the filesystem were unaffected.

Incident timeline – a concise recap

GitLab was attacked via malicious snippets, causing PostgreSQL replication issues.

An exhausted operator executed

rm -rf /

on a healthy DB server instead of the faulty one, reducing 300 GB to 4.5 GB.

A manual LVM snapshot taken shortly before the failure saved a recovery point.

Phase 1 – Detection

At 18:00 UTC on 31 Jan, spam‑sending IPs created snippets that destabilised the database. By 21:00 UTC the database could no longer accept writes and went down.

Actions taken

Blocked the spammer’s IP address.

Removed a user account that was generating massive load (≈ 47 000 IPs).

Deleted the user responsible for the malicious snippets.

Phase 2 – Replication lag

At 22:00 UTC the replication lag grew to several gigabytes because a high‑volume write burst was not processed.

Actions taken

Attempted to repair the lagging replica (db2).

Cleared

/var/opt/gitlab/postgresql/data

on the replica to force a clean sync.

Increased

max_wal_senders

to 32 and restarted PostgreSQL.

Reduced

max_connections

from 8000 to 2000 and restarted.

Handled numerous PostgreSQL errors (signal overload, connection refusals).

Phase 3 – Accidental deletion

At 23:00 UTC the engineer realized he had run the delete command on the production server (db1) instead of the problematic replica (db2). The operation stopped at 23:27, but only 4.5 GB of the original 300 GB remained.

GitLab.com was taken offline and a status tweet announced “Database emergency maintenance…”.

Problems uncovered

LVM snapshots run every 24 h; a manual snapshot taken six hours earlier saved the recovery point.

Daily backups (gitlab‑rake gitlab:backup:create) were mis‑configured and did not contain useful data.

pg_dump failed because workers were using PostgreSQL 9.2 binaries while the cluster required 9.6.

Fog gem cleanup removed earlier backups.

Only NFS servers had disk snapshots; DB servers did not.

Sync scripts deleted webhooks during pre‑release data sync.

Replication relied on fragile hand‑written shell scripts without proper documentation.

S3 backups also failed.

The five‑level backup/replication strategy proved ineffective; only the manual LVM snapshot allowed recovery to a point six hours before the incident.

pg_basebackup required ~10 minutes to initialise replication, causing further delays.

Recovery

Data were restored from the pre‑release environment backup:

00:36 UTC – Backed up

db1.staging.gitlab.com

.

00:55 UTC – Mounted the backup to

db1.cluster.gitlab.com

.

Copied data from

/var/opt/gitlab/postgresql/data

in staging to production.

01:05 UTC – NFS server repurposed as temporary storage.

01:18 UTC – Completed copy of remaining data, including

pg_xlog

, into

20170131-db-meltodwn-backup.tar.gz

.

The restoration was streamed live on YouTube, and engineers sought community help to accelerate the slow disk‑I/O bound process.

<code>https://www.youtube.com/watch?v=nc0hPGerSd4</code>
<code>https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/</code>

Key take‑aways

Never operate a database while fatigued or under the influence.

Alias dangerous commands like

rm

to a safer wrapper.

Test backups regularly by performing full restore drills.

Adopt a blameless post‑mortem culture focused on root‑cause analysis and improvement.

Consider secondary effects of emergency actions; always double‑check before executing.

Maintain a robust incident‑response plan with adequate spare hardware.

Avoid adding extra approval steps during crisis mitigation.

operationsDevOpsGitLabPostgreSQLBackuppostmortemdatabase incident
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.